3.2 Dataset

3.2.1 Data Collection Definition

To train and optimize the Computer Control Agent, the following data needs to be collected:

User Interaction Data: Records user actions in browsers or applications, such as clicks, text inputs, and scrolling. For example, when a user fills out a form on a webpage, the sequence of text entries and the action of clicking the "Submit" button are recorded.
Environment State Data: Includes screenshots, UI structures, webpage DOM trees, and window layouts in various task scenarios. For instance, in a browser environment, the structure and positions of elements (buttons, text fields, links, etc.) are recorded; in desktop and mobile environments, the layout of windows and controls is captured.
Task Outcome Data: Measures key performance indicators such as task completion accuracy and execution time across different strategies. For example, in an automated login task, it tracks whether the login was successful, how long it took, and any encountered errors.

Dataset Comparison & Structure Design

We analyzed compute-use datasets, including Mind2Web, AITW, WebShop, and Rico. Based on this comparative study, we designed a unified dataset structure that is both adaptable to different contexts and provides detailed actionable data. This structure is well-suited for PC, Web, and mobile environments.

Task and Subtask Representation
1. The definition of tasks and subtasks (episodes) is platform-agnostic.
2. Whether interacting with websites, desktop applications, or mobile apps, the decomposition of tasks into actionable steps remains applicable.
Element Identifiers
1. The object, object_struct_path, and object_struct_type fields store platform-specific identifiers.
2. For example, in mobile apps, an element’s ID could be a resource identifier, whereas, in desktop applications, it could be a window handle or GUI control name.
3. Similarly, object_struct_type adapts to different structures like DOM for Web, WinAppDriver for PC, or UIAutomator for mobile.
Actions and Behaviors
1. The action and behavior fields describe operations at a high-level, task-oriented abstraction.
2. This abstraction enables mapping the same task (e.g., opening the homepage, searching for an item) to platform-specific implementations.
3. For example, clicking a button on a mobile app may involve a touch event, while on a desktop app, it could require a mouse click or keyboard shortcut.
Environment Understanding
1. The environment_understanding field captures context-aware data across different platforms.
2. On mobile devices, VLMs (Vision-Language Models) may detect screen elements differently than on desktops, but the overall structure for capturing and integrating this information (parsers, VLMs, LLMs, fusion mechanisms) remains consistent.
State & Completion
1. The State field represents the overall task completion status and reasons, serving as a platform-independent progress indicator.

Based on these considerations, the labeled dataset fields are as follows:

{
  "task": "Purchase the cheapest 50-inch TV from Amazon",
  "episodes": [
    {
      "episode": 1,
      "subtask": "Open the Amazon website",
      "action": "Open the Amazon homepage",
      "object": "Amazon homepage",
      "object_struct_path": "#homepage",
      "object_struct_type": "DOM",
      "object_struct_content": "local_path/struct_content/episode1_amazon_home_struct.json",
      "behavior": "Enter the Amazon website URL in the browser and visit",
      "status": "completed",
      "execution_time": "2024-12-01T10:00:00",
      "current_time": "2024-12-01T10:00:00",
      "screenshot": "local_path/screenshots/episode1_amazon_home.png",
      "environment_understanding": {
        "parser": {
          "object_hierarchy": [...]
        },
        "vlm": {
          "object_detection": [...]
        },
        "llm": {
          "semantic_analysis": {...}
        },
        "merged": {
          "comprehensive_structure": {
            "dom_elements": {
              "navigation_bar": "Top section",
              "search_bar": "Center top",
              "logo": "Top-left"
            },
            "text_semantics": {
              "key_text": "Sign in, Cart, Search"
            },
            "visual_objects": {
              "logo_detected": true,
              "menu_items": ["Electronics", "Computers", "TVs"]
            }
          }
        }
      },
      "state": {
        "overall_task_completion": false,
        "reason": "..."
      }
    },
    ... 
}

3.2.2 Relying on Codatta for High-Quality Data

R6D9’s Computer Control Framework Relies on High-Quality Data

The R6D9 computer control framework requires high-quality data, but such data is not readily available. Instead, it must go through precise collection, professional annotation, and rigorous validation to form a truly valuable knowledge layer. Traditional data collection methods fail to meet these requirements due to several challenges:

Lack of suitable data contributors – High-quality data requires knowledgeable contributors, but few individuals in the market are willing to participate.
Unclear data ownership and revenue distribution – Contributors struggle to benefit from the long-term value of their data, leading to a lack of motivation.
Poor data liquidity – Even when high-value data is generated, efficient circulation mechanisms are lacking, making it difficult to maximize commercial value.

To continuously optimize R6D9, a stable knowledge production and assetification mechanism is needed. To achieve this, we have partnered with Codatta, a leading Web3 data infrastructure provider, to build R6D9’s data workflow.

Codatta: A Knowledge Data Production & Assetification Platform

Codatta provides R6D9 with a complete solution for data acquisition, annotation, validation, revenue distribution, and assetification, solving the core challenges of knowledge data production. It offers three core capabilities to support R6D9:

FaaS (Frontier as a Service) Data Contribution Mechanism
- Codatta’s FaaS model enables R6D9 to rapidly launch and manage vertical data collection tasks while attracting data contributors. Through this mechanism, R6D9 can:
- - Define task rules – Clearly specify data collection and annotation standards, such as environment recognition labeling and robot behavior feedback.
  - Incentivize contributors – Contributors earn rewards based on data quality, increasing participation.
  - Automate data validation – Leverage Codatta’s reputation system and AI quality control to ensure data reliability.
Data Contributor Revenue Model (Royalty Model)
- Codatta employs a data royalty model, ensuring that contributors not only receive one-time payments but also benefit from their data’s long-term value. This model includes:
- - Contributor data ownership – When contributors upload data, the system automatically assigns ownership based on contribution levels and records it on-chain, securing their rights.
  - Ongoing revenue distribution – Contributors can continuously earn from their data through multiple revenue channels.
  - Asset liquidity – If contributors prefer an instant payout, they can transfer or sell their data ownership for immediate earnings.
- This model ensures short-term profits while creating opportunities for long-term data asset appreciation, attracting more high-quality contributors to participate in data production.
Knowledge Data Assetification
- The value of knowledge data lies not only in production but also in circulation and monetization. Codatta leverages Web3 technology to assetify R6D9’s high-quality data, enhancing market liquidity through:
- - On-chain data ownership certification – Blockchain technology guarantees transparent and immutable data ownership.
  - Privacy and security protection – Privacy-preserving and compliance mechanisms ensure secure data usage.
  - Enhanced liquidity – Using Web3 mechanisms such as DeFi, data assets can be traded, leased, or collateralized, expanding monetization opportunities.
- By assetifying data, Codatta lowers the cost of data circulation and makes it easier for contributors to generate income. This enhances data market activity, fostering the sustainable growth of the R6D9 data ecosystem.

Previous3.1 Environment Recognition Next3.3 Policy Training

Last updated 4 months ago

3.2.1 Data Collection Definition

To train and optimize the Computer Control Agent, the following data needs to be collected:

User Interaction Data: Records user actions in browsers or applications, such as clicks, text inputs, and scrolling. For example, when a user fills out a form on a webpage, the sequence of text entries and the action of clicking the "Submit" button are recorded.

Environment State Data: Includes screenshots, UI structures, webpage DOM trees, and window layouts in various task scenarios. For instance, in a browser environment, the structure and positions of elements (buttons, text fields, links, etc.) are recorded; in desktop and mobile environments, the layout of windows and controls is captured.

Task Outcome Data: Measures key performance indicators such as task completion accuracy and execution time across different strategies. For example, in an automated login task, it tracks whether the login was successful, how long it took, and any encountered errors.

Dataset Comparison & Structure Design

Task and Subtask Representation

The definition of tasks and subtasks (episodes) is platform-agnostic.
Whether interacting with websites, desktop applications, or mobile apps, the decomposition of tasks into actionable steps remains applicable.

Element Identifiers

The object, object_struct_path, and object_struct_type fields store platform-specific identifiers.
For example, in mobile apps, an element’s ID could be a resource identifier, whereas, in desktop applications, it could be a window handle or GUI control name.
Similarly, object_struct_type adapts to different structures like DOM for Web, WinAppDriver for PC, or UIAutomator for mobile.

Actions and Behaviors

The action and behavior fields describe operations at a high-level, task-oriented abstraction.
This abstraction enables mapping the same task (e.g., opening the homepage, searching for an item) to platform-specific implementations.
For example, clicking a button on a mobile app may involve a touch event, while on a desktop app, it could require a mouse click or keyboard shortcut.

Environment Understanding

The environment_understanding field captures context-aware data across different platforms.
On mobile devices, VLMs (Vision-Language Models) may detect screen elements differently than on desktops, but the overall structure for capturing and integrating this information (parsers, VLMs, LLMs, fusion mechanisms) remains consistent.

State & Completion

The State field represents the overall task completion status and reasons, serving as a platform-independent progress indicator.

Based on these considerations, the labeled dataset fields are as follows:

{
  "task": "Purchase the cheapest 50-inch TV from Amazon",
  "episodes": [
    {
      "episode": 1,
      "subtask": "Open the Amazon website",
      "action": "Open the Amazon homepage",
      "object": "Amazon homepage",
      "object_struct_path": "#homepage",
      "object_struct_type": "DOM",
      "object_struct_content": "local_path/struct_content/episode1_amazon_home_struct.json",
      "behavior": "Enter the Amazon website URL in the browser and visit",
      "status": "completed",
      "execution_time": "2024-12-01T10:00:00",
      "current_time": "2024-12-01T10:00:00",
      "screenshot": "local_path/screenshots/episode1_amazon_home.png",
      "environment_understanding": {
        "parser": {
          "object_hierarchy": [...]
        },
        "vlm": {
          "object_detection": [...]
        },
        "llm": {
          "semantic_analysis": {...}
        },
        "merged": {
          "comprehensive_structure": {
            "dom_elements": {
              "navigation_bar": "Top section",
              "search_bar": "Center top",
              "logo": "Top-left"
            },
            "text_semantics": {
              "key_text": "Sign in, Cart, Search"
            },
            "visual_objects": {
              "logo_detected": true,
              "menu_items": ["Electronics", "Computers", "TVs"]
            }
          }
        }
      },
      "state": {
        "overall_task_completion": false,
        "reason": "..."
      }
    },
    ... 
}

3.2.2 Relying on Codatta for High-Quality Data

R6D9’s Computer Control Framework Relies on High-Quality Data

Lack of suitable data contributors – High-quality data requires knowledgeable contributors, but few individuals in the market are willing to participate.

Unclear data ownership and revenue distribution – Contributors struggle to benefit from the long-term value of their data, leading to a lack of motivation.

Poor data liquidity – Even when high-value data is generated, efficient circulation mechanisms are lacking, making it difficult to maximize commercial value.

Codatta: A Knowledge Data Production & Assetification Platform

FaaS (Frontier as a Service) Data Contribution Mechanism

Codatta’s FaaS model enables R6D9 to rapidly launch and manage vertical data collection tasks while attracting data contributors. Through this mechanism, R6D9 can:
- Define task rules – Clearly specify data collection and annotation standards, such as environment recognition labeling and robot behavior feedback.
- Incentivize contributors – Contributors earn rewards based on data quality, increasing participation.
- Automate data validation – Leverage Codatta’s reputation system and AI quality control to ensure data reliability.

Data Contributor Revenue Model (Royalty Model)

Codatta employs a data royalty model, ensuring that contributors not only receive one-time payments but also benefit from their data’s long-term value. This model includes:
- Contributor data ownership – When contributors upload data, the system automatically assigns ownership based on contribution levels and records it on-chain, securing their rights.
- Ongoing revenue distribution – Contributors can continuously earn from their data through multiple revenue channels.
- Asset liquidity – If contributors prefer an instant payout, they can transfer or sell their data ownership for immediate earnings.
This model ensures short-term profits while creating opportunities for long-term data asset appreciation, attracting more high-quality contributors to participate in data production.

Knowledge Data Assetification

The value of knowledge data lies not only in production but also in circulation and monetization. Codatta leverages Web3 technology to assetify R6D9’s high-quality data, enhancing market liquidity through:
- On-chain data ownership certification – Blockchain technology guarantees transparent and immutable data ownership.
- Privacy and security protection – Privacy-preserving and compliance mechanisms ensure secure data usage.
- Enhanced liquidity – Using Web3 mechanisms such as DeFi, data assets can be traded, leased, or collateralized, expanding monetization opportunities.
By assetifying data, Codatta lowers the cost of data circulation and makes it easier for contributors to generate income. This enhances data market activity, fostering the sustainable growth of the R6D9 data ecosystem.