Technical Architecture
Last updated
Last updated
Figures 2.1 and 2.2 illustrate examples of agent execution in a browser and a mobile app, respectively. Although these application scenarios interact with entirely different environments, they can be abstracted into a unified Computer Control Agent model, as shown in Figure 2.3. Given a user input command , the agent must interact with the environment to complete the corresponding task. This process can be broken down into the following stages:
Environmental Perception: At any time , the environment is in a state , but the agent cannot fully observe Instead, it receives partial observations (where the observation space ). For example, could be a screenshot of the current browser window, while includes the full state of the running browser.
Agent Input Preparation: The observation is often preprocessed before being fed into the agent, transforming it into This may involve cropping a screenshot or extracting a DOM tree from HTML.
Agent Action Planning: Based on the processed input and user command , the agent computes the next action using a policy . The policy is crucial for task accuracy and serves as the core component for algorithm optimization.
Action Grounding: The abstract action is converted into an executable command ata_t. For example, a high-level instruction like “click the submit button” is translated into a concrete command such as "click(x, y)", where (x,y)(x, y) represents the button’s coordinates.
This process repeats in a loop until a termination condition is met, such as completing the instruction or reaching the maximum step limit.
To ensure the smooth operation of this unified Computer Control Agent model, a well-designed technical architecture is essential. Figure 2.4 illustrates the proposed architecture, which includes the following key components:
Environment Recognition and Operation: The primary task of the Agent is to perceive and understand the operational environment to enable effective interaction. This process includes environment observation and interaction execution.
Environment Observation: In complex computational environments, the Agent must analyze different operating contexts, such as web environments, desktop applications, or mobile applications. Since each environment has unique data structures and interaction patterns, the Agent employs a multi-layered perception strategy:
In web environments, the Agent directly parses the DOM structure to extract interactive elements such as buttons, input fields, and links.
In desktop/mobile environments, the Agent relies on UI component recognition, OCR analysis, and window management to extract key information from the active interface.
Interaction Execution: After understanding the environment, the Agent must execute appropriate interactions to complete tasks. The primary interaction methods include:
API-based precise control: When an environment provides standardized APIs (e.g., DevTools Protocol, Windows UI Automation), the Agent can send direct commands for precise execution.
UI-based intelligent control: In cases where direct API access is unavailable, the Agent leverages computer vision, event listeners, and automation techniques to simulate interactions such as mouse clicks and keyboard inputs.
The accuracy of environment recognition and operation directly determines whether the Agent can perform tasks stably and efficiently. Therefore, system architecture must ensure adaptive environment modeling, enabling the Agent to adjust its perception methods and interaction strategies based on different application scenarios.
Data Sourcing and Labeling Pipeline: A highly efficient data sourcing and labeling pipeline is crucial for enhancing the Agent’s ability to understand and execute tasks. The dataset not only consists of static task descriptions but also includes real-world interaction records, which support continuous model optimization. To ensure strong generalization capabilities, the Agent requires diverse data sources:
User Interaction Data: Logs of real user actions such as clicks, inputs, and scrolling in browsers or applications, providing executable task execution paths for learning.
Environment State Data: Stores UI structures, web DOM trees, and window layouts across different task scenarios, helping the Agent build task environment models.
Task Outcome Data: Captures key performance metrics, such as task completion accuracy and execution time, enabling continuous optimization of task planning strategies.
Actions Planning Model: The Actions Planning Model is the core of task execution. It translates natural language instructions into executable action sequences while optimizing execution paths.
Task Parsing and Planning: Upon receiving a command, the Agent must decompose it into subtasks and determine the optimal execution strategy. This process includes:
Instruction Parsing: Extracting key tasks from user input and mapping them to executable actions. For instance, the command “Download a research paper and save it to Google Drive” can be broken down into: 1) Access the research paper website. 2) Identify and click the download button. 3) Open Google Drive and upload the file.
Hierarchical Task Planning: Implementing a layered planning strategy to break down complex tasks into smaller, executable subtasks, ensuring an optimized execution sequence.
Execution Strategy Optimization: During execution, the Agent must dynamically adjust its strategy based on the current environment to improve task success rates and efficiency. This includes:
Task Execution Path Optimization: Leveraging historical task data to predict the most efficient execution path, minimizing redundant operations.
Adaptive Error Recovery: When interactions fail or environments change, the Agent must automatically adjust execution methods, such as switching strategies or replanning the task sequence.
A highly efficient Actions Planning Model ensures that the Agent completes user tasks with minimal steps while maintaining robustness and adaptability in dynamic environments.