3.3 Policy Training

3.3.1 Hybrid Agent Strategy

In the process of building an efficient Computer Control Agent, a single type of agent has its limitations. To strike a balance between generalization ability and task execution efficiency, this architecture adopts a Multi-Agentic Architecture, combining the advantages of Foundation Agents and Specialized Agents to enhance system adaptability and execution efficiency.

Foundational Agent

These agents rely on pre-trained foundation models (LLMs, VLMs) as their policy π\pi, offering several advantages:

Broad task adaptability: By leveraging the contextual learning ability of foundation models, they can handle a variety of tasks across different environments.
No need for extensive task-specific training: They can generate interaction actions (e.g., `click(id=search)`) directly based on natural language instructions, making them suitable for dynamically changing environments.
Support for multi-turn interaction optimization: Through contextual adjustments and adaptive learning, they can improve task execution success rates.

However, they also have drawbacks:

Potential inefficiency: Since Foundation Agents rely on large model inference, they may lack task-specific optimizations, leading to slower response times.
Require additional task-specific adjustments to ensure execution accuracy.

Specialized Agents

These agents depend on specially designed deep learning architectures as their policy π\pi, offering the following benefits:

Optimized for specific tasks: For instance, directly predicting UI operations (e.g., clicking at coordinates (x,y)(x, y)) to execute fixed tasks more efficiently.
Reinforcement learning enhancement: Continuously optimizing strategies through interaction with the environment, improving task completion accuracy and efficiency.
Enables efficient execution: Bypassing the inference process of foundation models, these agents generate specific execution actions directly based on observation information, reducing computational resource consumption.

However, they also have limitations:

Weaker generalization ability, making it difficult to adapt to unknown tasks, requiring separate training for different task models.
Higher demand for task annotation data and reinforcement learning optimization processes.

Hybrid Agent Strategy

To fully leverage the strengths of both approaches, this architecture employs a hybrid strategy, dynamically selecting the appropriate agent for task execution in different scenarios:

Basic task parsing: Foundation Agents handle user natural language instructions and generate an initial task plan.
Task execution optimization: If the task involves high-efficiency interactions (e.g., clicking, dragging, form filling), Specialized Agents are invoked for precise execution.
Adaptive environmental adjustments: In new environments, Foundation Agents first explore and determine the optimal execution strategy, after which Specialized Agents refine the operation path for better efficiency.

This hybrid approach ensures both adaptability and efficiency, enabling strong generalization in open environments while delivering high-performance execution for specific tasks.

3.3.2 Mixed Policies

In the decision-making process of a Computer Control Agent, the policy is the core mechanism that determines the agent’s next action. Different policies are suited to varying levels of task complexity and execution requirements. This architecture adopts a Mixed Policies approach, combining Memoryless Policies, History-based Policies, and State-based Policies to strike a balance between generalization capability and execution efficiency.

Memoryless Policies

These policies rely solely on the current observation data $o_t$ to predict the next action $a_t$ , without considering past observations or historical behaviors.

Advantages:
- High computational efficiency: No need to store or process historical information, making it suitable for high-concurrency tasks.
Disadvantages:
- Limited ability to handle complex tasks: Not well-suited for multi-step tasks or applications involving cross-page interactions.

History-based Policies

These policies accumulate observation and action history $h_t = (o_0, …, o_{t-1}, a_0, …, a_{t-1})$ , using past observations and decision behaviors to guide the next action.

Advantages:
- Enhanced task continuity: Maintains context in multi-step tasks, improving operational accuracy.
Disadvantages:
- Higher computational cost: As historical data accumulates, computational complexity increases, requiring optimized storage and processing strategies.

State-based Policies

These policies maintain an internal state $m_t$ using a state update function $m_{t+1} = f(o_t, m_t)$ , tracking environmental changes to inform decision-making.

Advantages:
- Strong environmental adaptability: Retains long-term task state information, improving performance on complex tasks.
Disadvantages:
- Higher computational resource demand: State updates often rely on recurrent neural networks (RNNs) or Transformers, leading to significant computational costs.

Mixed Policies

This approach combines History-based and State-based policies, leveraging past actions, current observations, and external textual states to enhance task continuity and generalization ability. The system dynamically selects Memoryless, History-based, or State-based policies depending on task complexity.For high-complexity tasks (such as e-commerce recommendations, automated trading, and intelligent data entry), Mixed Policies provide a more intelligent and stable operational capability.

Previous3.2 Dataset NextRoadmap

Last updated 6 months ago