UItron: Foundational GUI Agent with Advanced Perception and Planning
Zhixiong Zeng, Jing Huang, Liming Zheng, Wenkang Han, Yufeng Zhong, Lei Chen, Longrong Yang, Yingjie Chu, Yuzhi He, Lin Ma

TL;DR
UItron is a foundational GUI agent model that integrates advanced perception and planning, leveraging systemic data engineering and interactive infrastructure to improve performance in GUI tasks, especially in Chinese app scenarios.
Contribution
The paper introduces UItron, a novel open-source GUI agent model with advanced perception, grounding, and planning capabilities, and demonstrates its effectiveness through comprehensive data strategies and interactive environments.
Findings
UItron outperforms existing models in GUI perception, grounding, and planning benchmarks.
It achieves significant progress in Chinese mobile app scenarios.
The model demonstrates strong interaction proficiency with top-tier Chinese apps.
Abstract
GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data…
| Model | OSWorld |
| Compute-Use Agent (CUA) | |
| OpenAI CUA | 26.0 |
| Claude CUA | 31.2 |
| OpenCUA-32B | 29.7 |
| GUI Agent | |
| Qwen2.5-VL-72B | 4.4 |
| Augvis-72B | 10.3 |
| UI-TARS-7B | 18.7 |
| UI-TARS-72B | 22.7 |
| UI-TARS-1.5-7B∗ | 23.3 |
| UItron-72B | 24.9 |
| Method | Step SR | Task SR |
| Offline Environment | ||
| UI-TARS-7B | 75.1 | 22.4 |
| UI-TARS-72B | 80.5 | 32.8 |
| UI-TARS-1.5-7B | 77.4 | 29.3 |
| UItron-7B | 82.7 | 40.5 |
| UItron-72B | 84.1 | 47.4 |
| Method | Task SR |
| Online Environment | |
| UI-TARS-1.5-7B | 38.9 |
| UItron-7B | 44.4 |
| UItron-72B | 54.1 |
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper presents thorough research and engineering efforts in data engineering and unified interactive infrastructure, and demonstrates their effectiveness in developing GUI agents/models. These empirical exploration and findings are potentially valuable for the future development of GUI models. 2. The open-sourced UItron models are beneficial to both research community and the industry as the fundation model for subsequent research and development of GUI agents.
1. While the paper claims that UItron demonstrates superior performance on perceptual and grounding tasks, judging from the results of UItron-7B vs. UI-TARS-7B in table 2 and 3, it seems to suggest that UItron only demonstrates comparable or slightly worse performance compared with UI-TARS at 7B scale. 2. One of the main contribution of UItron is its leading performance in Chinese mobile app scenarios, however, given that it has put a lot of data engineering efforts specifically into Chinese mob
Originality: - This integration of multi-stage training and data engineering represents a novel and comprehensive approach to GUI agent development. Quality: - The paper demonstrates strong systematic data engineering investigation, multi-model reward validation, and extensive benchmarking. - The experiments are coherent with the methods, demonstrating the performance of each training stage. Clarity: - The paper is clear and well-organized. It elaborates on data engineering and training stage
- Although the paper contributes to unified interactive infrastructures and data engineering, most methods used in this framework are existing methods (e.g., SFT, GRPO), which might limit its novelty. - Although UItron employs “backtracking” and structured reasoning formats, its internal decision process remains opaque. Visualization or case studies could reveal whether backtracking contributes to actual causal reasoning.
- **Comprehensive 3-Stage Training Paradigm:** The paper's core strength is its well-conceived 3-stage training strategy. It logically builds the agent's capabilities: first, learning to *see* (Perception SFT) , then learning to *plan* (Planning SFT) , and finally, learning to *explore and reason* in a dynamic environment (CuRL). The ablation study provides strong quantitative support for this curriculum. - **Major Dataset Contribution:** The paper's most significant and lasting contribution is
- **Missing Model Architecture Details:** The most significant weakness is the complete omission of the base model architecture. The paper does not state which Large Language Model is used for Ultron-7B and Ultron-72B. If both of them are trained from scratch, the authors also need to detailedly introduce the backbone. - **Vague Details on "Backtracking" Inference:** The concept of "backtracking" is introduced as a key component of Stage 2 SFT . Equation (2) defines it as a training task. Howev
- Comprehensive design: Combines perception, grounding, and planning within a unified framework. - Innovative data pipeline: Effective use of backtracking, trajectory distillation, and multimodal data unification. - Strong empirical results: Outperforms or matches leading models like UI-TARS across diverse benchmarks.
- The paper lacks clear novelty. Most of its contributions lie in data engineering, while there is little innovation on the algorithmic side. However, many details of the data construction and processing are not disclosed. - The experimental evaluation is also insufficient. There are too few baselines—models such as Claude 4, OpenAI o3, and UI-TARS-2 should have been included for comparison. - Moreover, the experiments lack deeper analysis, such as tracking changes in reward, entropy, or other
The paper presents a comprehensive and systematic framework for building foundational GUI agents by combining large-scale data engineering, interactive infrastructure, and curriculum-based reinforcement learning. This integration across perception, grounding, and planning tasks represents a novel formulation compared to prior GUI automation studies that focused on isolated components. The authors design a three-stage training pipeline (Perception SFT → Planning SFT → Curriculum RL) supported by
1. Missing details in the training process. The paper proposes a training framework but lacks many crucial implementation details, which raises concerns about reproducibility and the validity of its claims. - The paper only provides the construction method and scale of the *Distillation Data*, but omits detailed information about the *Perception Data*, *Planning Data*, and *General Multimodal and Manual Annotation Data*. Even if the datasets cannot be released, the authors should provide detail
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Social Robot Interaction and HRI · Explainable Artificial Intelligence (XAI)
††footnotetext: † Equal contribution. ∗ Corresponding author.
UItron: Foundational GUI Agent with Advanced Perception and Planning
Zhixiong Zeng*†, Jing Huang†*, Liming Zheng, Wenkang Han,
Yufeng Zhong, Lei Chen, Longrong Yang, Yingjie Chu, Yuzhi He, Lin Ma∗**
Meituan
[email protected], [email protected]
Project: https://github.com/UITron-hub/UItron
Abstract
GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.
1 Introduction
GUI agents hong2024cogagent ; zhang2024you ; zhang2024android ; yang-etal-2025-aria ; wuatlas ; gou2025uground ; xu2025aguvis ; qin2025ui ; lin2025showui ; chen2025less aim to automatically execute complex tasks in various digital environments such as PC and Mobile, satisfying the growing expectations of autonomous decision-making and software control in human-computer interaction. These agents decompose the task instructions into multi-step actions by observing the screen status, then navigate and manipulate the on-screen elements following the human-like interactive manners (i.e., click, scroll). This human-like approach provides visually trackable trajectories with step-by-step task execution process, enabling convenient user interaction and explainable decision-making. Therefore, GUI agents have received a rapidly growing amount of attention, becoming an important research topic toward achieving artificial general intelligence.
The pursuit of automated GUI agent has been going on for a long time. Early methods deng2023mind2web ; gur2024real ; lai2024autowebglm utilize optical character recognition and icon detectors to parse GUI environments into textual elements (e.g., HTML and AXTree) as input to the LLM, leveraging its powerful reasoning capabilities to plan and generate multi-step executable actions. The rapid advancement of vision-language models catalyze a series of GUI agents (e.g., yang-etal-2025-aria ; wuatlas ; gou2025uground ; xu2025aguvis ; qin2025ui ) that operate directly on visual GUI images, which achieve superior performance within the framework of unified perception and planning in pure vision. A representative work is UI-TARS qin2025ui , which achieves leading performance via a large amount of data engineering and a carefully designed iterative training framework. Recently, some R1-style works lu2025ui ; liu2025infigui ; zhou2025gui ; tang2025gui ; yang2025zerogui ; dong2025agentic represented by GUI-R1 zhou2025gui designs multimodal reasoning data and explores the typical group relative policy optimization to improve reasoning ability, reporting improved results in grounding benchmarks. To address the limited adaptability in offline environments, ZeroGUI yang2025zerogui and ARPO dong2025agentic propose the online reinforcement learning framework, which adopts VLM-based automatic reward estimation to assess task success and continuously learn from the GUI environments, without hand-crafted evaluation functions.
However, developing automatic GUI agents still remains a highly challenging task due to several limitations: the scarcity of annotated operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. As illustrated in Figure 1, training a GUI agent necessitates precise GUI perception, grounding, offline planning, and online planning capabilities, thereby rendering the collection of adequately annotated trajectory data exceedingly difficult. Moreover, in contrast to traditional multimodal tasks, GUI agents face the additional challenge of requiring an interactive environment to execute the model-generated actions that enables multi-round interaction. Additionally, extensive empirical evidence shows that current foundation models typically possess limited performance in GUI scenarios, substantially hindering progress toward effective GUI agent development.
In this paper, we introduce UItron, a powerful open-source foundational model for automatic GUI agents, with powerful GUI perception, task grounding, offline/online planning capabilities. UItron emphasizes the importance of data engineering and interactive infrastructure for developing GUI agents. For data engineering, we significantly expand the available operation trajectories through three key aspects: data unification, trajectory distillation, and manual annotation over different domains. Moreover, we systematically investigated a series of data engineering strategies to enhance training effectiveness, including the utilization of various trajectory elements (e.g., observation, thought and action), the exploration of different reasoning formats, and the incorporation of diverse reflection mechanisms like backtracing. We also find the advantages of integrating multi-task UI-related data and general multimodal data. For interactive infrastructure, we build an interactive environment connecting both Mobile and PC devices. It not only simplifies trajectory data collection by automatically recording screenshots and coordinates, but also provides a foundation for online reinforcement learning (RL) during training.
During training, we employ a three-stage training strategy over several GUI scenarios, which includes GUI perception, planning and RL. Note that the RL stage is specifically designed to enhance complex reasoning and exploration capabilities within online environments. First, UItron adopts a supervised finetuning strategy for GUI perception and planning tasks. The perception task focuses on improving the basic understanding ability of the vision-language model in GUI scenarios, such as grounding, captioning, VQA, and OCR. The planning stage concentrates on forecasting the next action based on historical actions. Then UItron develops a Curriculum Reinforcement Learning (CuRL) framework with group relative policy optimization algorithm on trajectory data. To address the problem of sparse rewards, CuRL first computes dense rewards from the action steps in the offline environment (simple), and then computes the task-level reward for the trajectory data in the online environment (complex). In addition, to improve the credibility of rewards in the RL process, we strictly filter the trajectories that are predicted correctly by multiple scoring models simultaneously.
In particular, UItron emphasizes its ability to interact with top-tier mobile Apps in China, as we find that even state-of-the-art solutions generally underperform in Chinese scenarios for GUI agents. To this end, we meticulously annotate over one million action steps from the top 100 monthly active Apps, covering mainstream interaction scenarios such as social networking, office work, entertainment, and shopping. Based on this, we constructed an offline evaluation dataset to assess the capabilities of GUI agents in Chinese App scenarios, evaluating the performance of different models based on two classical evaluation metrics: single-step success rate and task completion rate. To evaluate the realistic interaction performance of GUI agent in real-world applications, we also build an Android-based cloud real-device environment for online evaluation. Specifically, we develop a rollout method to alternately execute actions and refresh status between GUI agent and Android-based cloud environment. Next, we developed an automated scoring mechanism that leverages multiple VLMs to score the entire task trajectory, and averages the scores to generate evaluation results. Experimental results not only confirm the limitations of existing methods, but more importantly, demonstrate that UItron has achieved substantial progress in Chinese application scenarios, advancing GUI agents toward practical and real-world deployment.
The main contributions of this work are summarized as follows:
- •
We present a systematic investigation of data engineering and interactive infrastructure that effectively supports the development of foundational GUI agents.
- •
We develop a curriculum reinforcement learning framework with dense and credible rewards for trajectory data in GUI agents.
- •
We open-source UItron, achieving superior performance in benchmarks of GUI perception, grounding and offline planning, and competitive results in online agent environments.
- •
We significantly improves the interactive capabilities of UItron in Chinese scenarios through carefully labeled data and tailored online environments.
2 Related Works
2.1 MLLM
Large language models (LLMs) have shown strong generalization capabilities and instruction-following abilities. However, they can only process text information, while real-world applications require models to understand visual information. Thus, Multimodal Large Language Models (MLLMs) such as GPT-4 achiam2023gpt and LLaVA liu2023visual utilize visual encoders and visual projectors to integrate visual data into large language models. MLLM works on visual encoder mainly focus on improving input resolution, which can be roughly divided into direct scaling and patch-division. Direct scaling, e.g., bai2023qwen ; liu2024improved ; hong2024cogagent , inputs higher-resolution images into the encoder, usually requiring further fine-tuning of the encoder or the use of a pre-trained encoder with higher resolution. Patch-division, e.g., lin2023sphinx ; li2024monkey , splits high-resolution images into multiple patches and then reuses the low-resolution encoder, which has gradually become the mainstream choice because it can support dynamic resolution or natural resolution chen2024far ; wang2024qwen2 ; wu2024deepseek ; bai2025qwen2 ; guo2025seed1 , making it suitable for processing varying visual information. MLLM works on projector mainly focuses on the effective mapping of visual information, which can be roughly divided into token-level and feature-level. The token-level projector converts the output features into tokens and concatenates them with text tokens before feeding them into the large language model. Some works use Q-Former, e.g., zhang2023video ; dai2023instructblip , but most works directly use MLP to bridge the modal gap due its simplicity and generalization, e.g., liu2023visual ; su2023pandagpt ; liu2024improved ; wang2024qwen2 ; bai2025qwen2 . The feature-level projector, e.g., alayrac2022flamingo ; zhang2023llama ; bai2023qwen ; wang2024cogvlm , inserts additional modules that enable in-depth interaction and fusion between text and visual features. For example, Flamingo alayrac2022flamingo inserts additional cross-attention layers into the frozen LLM layers to enhance language features using external visual cues.
Beyond architecture, scaling data is key to improve the performance of MLLMs, and numerous studies have focused on enhancing the data quality for MLLMs. For example, many MLLM studies focus on improving the data quality of instruction fine-tuning, e.g., dai2023instructblip ; zhang2023llama ; wang2023visionllm ; gao2023llama ; luo2023cheap ; chen2024internvl ; liu2024llavanext ; chen2024sharegpt4v ; guo2025seed1 . In addition, some studies, e.g., wang2022self ; liu2023visual ; zhu2023minigpt ; yang2023gpt4tools ; wang2023see ; chen2024allava , collect samples through self-instruction, where they use large language models to generate data that conform to text instructions based on a small number of manually annotated samples. Then, language data is usually mixed with multimodal data to train the model ye2023mplug ; luo2023cheap ; gao2023llama , improving the instruction-following ability. In addition, preference alignment is often used in scenarios where the model needs to align with specific human preferences. Reinforcement Learning from Human Feedback (RLHF) ouyang2022training and Direct Preference Optimization (DPO) rafailov2023direct are two main techniques for preference alignment.
2.2 GUI Agent
Early GUI agents rely on HTML or AXTree data to describe the screen state of user interactions through textual state descriptions. These methods depend on the structured representation of web page elements, enabling agents to locate targets based on tags, attributes, or text content. Raw HTML or AXTree data usually has redundant or noisy structures. Thus, some methods have studied the extraction of effective information from HTML. For example, Mind2Web deng2023mind2web uses a fine-tuned language model to sort web page elements and extract important ones. WebAgent gur2024real uses a dedicated HTML-T5 model to generate HTML fragments for specific tasks. AutoWebGLM lai2024autowebglm designs an algorithm to simplify HTML content. Some methods he2024webvoyager ; yang2023setofmark combine visual information. For instance, Set-of-Mark yang2023setofmark integrates visual and tree tag information for agent decision-making. However, the quality of HTML or AXTree data can limit the application of the aforementioned methods, considering different standards across platforms. In addition, structured data requires meticulous pre-processing, and in some cases, it also relies on additional heuristic methods or trained models to accurately locate and understand key GUI components.
With the rise of multimodal large language models (MLLMs), an increasing number of pure vision-based methods have been proposed. These methods leverage the strong visual capabilities of multimodal language large models and eliminate the need for manually designing data pre-processing for each task, so they have significant advantages over early GUI agents in terms of generalization. Early pure visual GUI agents focus on using MLLMs to process screenshots to understand GUI components, replacing HTML-based GUI component understanding. For example, CogAgent hong2024cogagent uses MLLMs to process high-resolution GUIs, achieving performance that surpasses HTML-based methods. Auto-GUI zhang2024you unifies GUI grounding into a text-driven grounding task and proposes action chains, using a series of historical actions to enhance agents; COAT zhang2024android further models the thinking process of "what action should be performed" to improve agents in complex tasks.
Furthermore, researchers have found that scaling data is key to enhancing the performance of GUI agents, thus proposing using synthetic data or video data. For instance, Aria-UI yang-etal-2025-aria proposes an extensible data synthesis pipeline for generating grounding data, which is used to train MLLMs specialized in GUI grounding. OS-Genesis yang-etal-2025-aria proposes a trajectory synthesis method to retrospectively generate tasks from the agent-environment interaction, rather than relying on manual supervision or predefined tasks. GUI-explorer xie2025gui automatically generates function-aware task objectives by analyzing GUI structure information, and achieves low-cost generalization of agents through unsupervised analysis of state transitions in the observation-action-result triples. OS-Atlas wuatlas releases the first multi-platform GUI data synthesis toolkit, supporting the automatic synthesis of cross-platform GUI grounding data while resolving action naming conflicts. GUI-Xplore sun2025gui enables GUI agents to learn from exploration videos.
Recently, researchers have aimed to make GUI agents more similar to human-like agents. For example, UGround gou2025uground executes actions only through human-like keyboard and mouse operations. Aguvis xu2025aguvis unifies different GUI action spaces and divides training into grounding and planning, enhancing the action ability after the agent has high GUI grounding performance. GUI-Odyssey lu2025gui presents a large-scale cross-application GUI agent training and evaluation dataset, allowing agents to interleavingly use multiple applications to execute tasks. UI-TARS qin2025ui can perform human-like interactions, which achieves accurate perception of GUI elements by collecting a large amount of screenshot data and enhances agent capabilities through various reasoning modes.
The efficiency of GUI agents is crucial for practical deployment, so some research efforts are also dedicated to this. For example, ShowUI lin2025showui constructs a UI correlation graph to identify redundant UI relationships and selects tokens based on the identification results to improve training efficiency. SimpleAgent chen2025less masks redundant elements in the current environment and uses consistency constraints to guide the cropping of historical tokens.
However, the aforementioned methods mainly focus on using Supervised Fine-Tuning (SFT) to fit manually annotated action trajectories for enhancing the performance of GUI Agents. In fact, numerous studies have shown that the single SFT method limits the model ability for autonomous exploration, which is crucial in GUI Agent tasks. Early works luo2025gui ; lu2025ui directly optimize models using a 0-1 reward similar to deepseek-R1 guo2025deepseek . For example, UI-R1 lu2025ui uses rule-based action rewards and the GRPO algorithm to enhance the performance of GUI agents in unknown scenarios. Further optimizations have focused on issues such as reasoning patterns liu2025infigui ; zhou2025gui , sparse rewards yuan2025enhancing ; tang2025gui , and online learning tang2025gui ; wei2025webagent . InfiGUI-R1 liu2025infigui proposes to first use trajectories with explicit reasoning steps for training, and then apply reinforcement learning to enhance the error correction ability. GUI-G1 zhou2025gui uses a fast-thinking template to encourage the model to generate answers directly, reducing excessive reasoning during training, and simultaneously designs a difficulty-aware RL objective to better learn hard samples. SE-GUI yuan2025enhancing calculates continuous rewards using the proximity between the predicted box and the GT box, replacing the 0-1 reward to alleviate the sparse reward problem. GUI-G2 tang2025gui transforms the discrete classification of GUI grounding into the continuous optimization of IoU through Gaussian point rewards, coverage rewards, and an adaptive variance mechanism, addressing the sparse reward. Zero-GUI yang2025zerogui proposes the ZeroGUI online learning framework, which automatically generates tasks and estimates rewards.
2.3 API Agent
Another type of agent, distinguished by its mode of interaction with computers or mobile phones, is the API-centric agent. These agents interact with external tools, functions, or services through pre-defined, well-structured programming interfaces. During interaction, relevant API information (e.g., function names) is included in the LLM prompt. The agent receives natural language requests from users and selects the most appropriate API to execute the task.
Microsoft Copilot is a typical example of an enterprise-level API agent. Through interfaces such as the Microsoft 365 Copilot API, it allows developers to integrate capabilities like data analysis and document generation into custom applications. Related research focuses on issues such as automated generation/updating of tools gao2024clova ; wang2024trove ; trivedi2024appworld ; wang2024llms ; wang2024llms , simplification of tools yuan2025easytool , and patterns of tool usage du2024anytool ; lin2025robust . CLOVA gao2024clova identifies tools that need updating by analyzing human feedback, automatically collects training data, and uses prompt tuning to update the tools. TroVE wang2024trove constructs a verifiable and efficient function toolbox through generating, using, expanding, and periodically streamlining the toolbox. Appworld trivedi2024appworld builds a high-quality execution environment for agents and creates a set of autonomous agent task sets that require agents to generate cross-application interaction code for processing. STE wang2024llms utilizes large models to simulate reasonable environments for tool usage, then enables large models to interact with tools and learn from environment feedback. Kimi K2 team2025kimi proposes a trajectory synthesis scheme for function calls, relying on a vast tool specification library constructed from real-world tools and synthesized tools. Then, EasyTool yuan2025easytool converts tool documents into unified and concise instructions to improve tool usage efficiency. Anytool du2024anytool retrieves APIs to handle user needs and proposes a self-reflection mechanism. Hammer 2.1 lin2025robust improves the model sensitivity to irrelevant functions through enhanced datasets and function masking technology. API agents rely on text-based API calls, which are generally highly reliable, and can complete complex tasks with a single call. However, they are limited to pre-defined APIs, have low transparency, and lack the generalization of human-like actions exhibited by GUI agents.
3 UItron
UItron is an open-source foundational GUI agent framework designed to advance automated interaction and reasoning across both mobile and PC environments. The system is built upon two key pillars: a robust data engineering pipeline tailored for GUI agent training, and a unified interactive infrastructure that supports scalable data collection and dynamic training. Leveraging these foundations, UItron delivers core capabilities in perception, grounding, and planning, enabling agents to understand complex interfaces, accurately localize tasks, and execute effective action sequences in diverse real-world scenarios.
3.1 Problem Formulation
GUI Agent aims to predict the next action in the -th step based on the task instruction, historical actions and visual environment observation (a GUI image). The action is usually represented by the action type and parameters, such as the click(box) and input(content). Formally speaking, we denote the task instruction as , the historical action as , and the visual environment observation as . Therefore, the task goal of GUI agent in the -th step can be formulated as:
[TABLE]
here represents the GUI agent with trainable paratemers .
Note that previous work in this area usually utilizes multiple historical images to augment the input with historical information. It significantly increases the length of the input sequence and the computational cost, which is detrimental to the redundant nature of visual information. In fact, we empirically found that omitting historical images does not result in a significant performance degradation in most benchmarks, as historical action information provides sufficient gains. Therefore, to mitigate computational cost, we did not use historical images in this version.
3.2 Data Engineering
As shown in Figure 2, we explore systematic data engineering to improve UItron, including perception data, planning data, and distillation data. Besides, we also organize a small amount of general multimodal data that is beneficial to GUI agent, as well as high-quality manual annotation data for Chinese scenarios.
3.2.1 Perception Data
**Multi-turn Conversation. ** In practical applications, a single complex screenshot can contain hundreds of UI elements, and open-source grounding datasets typically feature multiple objects within one image. To minimize redundant image loading and decrease training costs, we consolidated various instruction/description-answer pairs associated with the same screenshot into unified multi-turn conversations, thereby constructing multi-turn training samples. Utilizing such multi-turn data for training not only lowers computational overhead but also improves the model’s ability to comprehend and distinguish between different elements within a UI scene.
Multi-task Unification. To enhance the basic understanding ability in GUI scenarios, we collect a large amount of UI-related perception data instead of just considering the traditional agentic trajectory data. We collect a wealth of image-text multimodal pairs from a wide range of PC/mobile application screenshots, covering tasks in GUI scenarios such as OCR, VQA, and Caption. We then integrate these UI-related perception data and traditional agentic trajectory data into the unified format to support training.
Cross-platform Generalization. Although a substantial amount of grounding data already exists in the GUI agents field, but data collected from different platforms and devices often lacks generalizability, and various tasks employ distinct and isolated data synthesis criteria, making it challenging for these datasets to complement one another. To address the generalization challenge in GUI grounding, we integrated data from diverse sources and synthesis methodologies within the GUI agent domain. By unifying open-source datasets (including Uground gou2024navigating , Aria-UI yang2024aria , Aguvis xu2024aguvis and OS-Atlaswu2024atlas ), our approach systematically explores whether diverse synthesis criteria can complement one another, thereby enhancing the generalization capability of agent localization across various scenarios.
3.2.2 Planning Data
L1-L2-L3 Inference. In addition to the final output action, the execution of a planning task can be enhanced by incorporating multiple levels of perception and reasoning to facilitate action prediction. Following xu2025aguvis , we divide the planning data into several elements including screen observation, reasoning (thinking), action and summarization. The L-1 inference involves only action prediction and summarization, L-2 inference further introduces reasoning, and L-3 inference incorporate screen context to observe and analyze changes in the UI interface. This multi-layered and fine-grained perception strategy enables the model to better adapt to tasks of varying complexity and diverse scenarios. To balance efficiency and accuracy, we utilize L2-level descriptions as historical context prompts for action prediction during inference.
Back-tracking. The planning process of a GUI agent can be naturally formulated as a partially observable Markov decision process, in which the model predicts the next action based on historical actions and the current state. However, this approach neglects the model’s capacity for reflection and backtracking on previous decisions. Specifically, while the model is aware of its current state, it lacks insight into the sequence of actions that led to that state. Consequently, the model struggles to establish connections between past, present, and future states, which hinders its ability to generate consistent and coherent action predictions. Following huang2025scaletrack , we enhance the interaction between GUI agents and their environment by introducing backtracking. Specifically, at each time stept, agent not only predicts the next action based on the current overall goal, but also infers the sequence of historical actions that resulted in the present state.
Thinking format. To more precisely distinguish the reasoning process from action prediction during inference optimization, and to facilitate seamless integration with function calls, we employ explicit separators to demarcate different sections of the model output. Specifically, the model’s output is structured in the following format:
<observation> Observation </observation>
<think> Thought </think>
<tool_call> Actions </tool_call>
<conclusion> Conclusion </conclusion>
3.2.3 Distillation Data
Manual labeling of long trajectory data in real scenarios is costly. Therefore, we construct a fully automated trajectory collection process, which includes three stages: (1) Automated task generation guided by real tasks, (2) Automated task execution in simulation environment, and (3) Trajectory result judgment based on VLM voting. After data cleaning and splitting, we finally obtain 500k single-step trajectory data for training.
Task Generation. Directly generating tasks with a VLM, without contextual awareness of specific scenarios, often results in unclear or unexecutable tasks. To mitigate this issue, we utilized the initial states of 369 existing tasks in Osworld as prompts to generate additional, related yet distinct tasks using the GPT-4o extension. Furthermore, to prevent task misalignment caused by varying initial states, each generated task is paired with its corresponding initial state.
Trajectory Distillation. Building on the multi-domain tasks generated by the VLM, we integrated state-of-the-art GUI agent models into the Osworld simulation environment and implemented a concurrent trajectory distillation pipeline. For each task, the model is allowed up to attempts, with the reasoning process and specific actions of each step recorded. The complete execution trajectory and task details are then evaluated by a VLM to determine whether the task was successfully completed. Additionally, we tracked the number of attempts for each task: data that succeeded in a single attempt were utilized for supervised fine-tuning (SFT), while data requiring multiple attempts were identified as challenging cases and used for GRPO training.
Voting Samples. As illustrated in Figure 2, the trajectory result discriminator is designed according to the following key principles: (1) Visual-Centric Evaluation: GUI agents primarily rely on interactions with the graphical interface. Therefore, changes in the GUI interface serve as the primary indicators of task execution status. (2) Voting Mechanism: Both in supervised fine-tuning (SFT) and GRPO training, even minor variations in training data quality can lead to fluctuations in model prediction accuracy. To ensure robustness, we adopt a stringent voting consensus mechanism, wherein each trajectory is sampled and evaluated multiple times. A trajectory is assigned a positive label only if all evaluations unanimously indicate success. (3) Difficulty Classification: Leveraging the multi-sampling strategy, each task is inferred multiple times by the model. The sample difficulty is then graded based on the number of successful executions across these inferences, enabling targeted application of samples to different training stages.
3.2.4 General Multimodal Data
The general multimodal data serves as a rich repository of fundamental and universal knowledge, intimately interwoven with GUI-related datasets. Recognizing this intrinsic connection, we augment our training regime with image-text pair data sourced from diverse task scenarios such as Optical Character Recognition (OCR), Visual Question Answering (VQA), and Image Captioning. This incorporation aims to bolster the GUI agents’ capability to seamlessly comprehend visual content and accurately interpret directive objectives across varied contexts, ultimately fostering a more holistic understanding of task execution dynamics. By leveraging this diverse array of multimodal inputs, we strive to enrich the GUI agents’ adaptability and cognitive depth, equipping them to meet increasingly complex interaction demands.
3.2.5 Manual Annotation.
A comprehensive and representative training dataset is essential for developing a robust GUI agent. However, most existing datasets are predominantly focused on English-language applications, leaving a significant gap in coverage for Chinese apps and interfaces. This imbalance limits the agent’s ability to generalize and perform effectively in Chinese application scenarios. To address this critical shortcoming, we assembled a dedicated team to manually collect operation trajectories specifically targeting top-tier Chinese mobile applications. Our annotation efforts focused on capturing diverse tasks, complex user interactions, and a wide variety of interface designs unique to the Chinese app ecosystem. Through this targeted data collection, we substantially broadened the scope and diversity of our training data, ensuring that UItron is equipped to excel in both English and Chinese application environments.
3.3 Interactive Infrastructure
To facilitate trajectory data collection, online evaluation and RL training, we build an interactive environment connecting both Mobile and PC devices, as shown in Figure 3. Specifically, its significance comes from the following three aspects. First, the Mobile and PC interactive environment provides an automated function for recording screenshots and coordinates, which significantly simplifies the difficulty of manually annotating trajectory data and thus accelerates our efficiency in collecting trajectories for Chinese scenarios. Then, the Mobile and PC interactive environment provides a more realistic evaluation environment, simulating the real interaction process between GUI agents and humans. Finally, the Mobile and PC interactive environment provides the execution results of each action output that facilitate online reinforcement learning for the entire trajectory.
Mobile Infra. We build an Android-based cloud real-device environment, which connects multiple genuine Android devices via a server, allowing users to remotely control these smartphones through a web browser. The system is composed of three key components:
- •
Scrcpy: Responsible for streaming the smartphone’s screen content to the browser in real time, similar to live streaming.
- •
Phone-server: Converts user interactions like clicks and swipes made in the browser into touch commands that the smartphone can understand.
- •
Device-agent: Serves as the device management center, integrating the functionalities of the previous two components and providing HTTP interfaces for application installation and device information retrieval.
The architecture follows an Agent/Server model, with the server side handling the user interface and device scheduling, while the Agent side manages the specific smartphone devices. Real-time communication is facilitated via WebSocket, and MySQL is used to store device and user data. This solution addresses prevalent issues in mobile application testing, such as insufficient device availability, incomplete model coverage, and the need for remote operations.
PC Infra. We utilize the open-source OSWorld environment OSWorld , a scalable real computer setting specifically designed for developing multimodal agents capable of executing a wide array of real computer tasks beyond isolated interfaces and applications. This executable environment allows unrestricted keyboard and mouse control over real computer applications, supporting initial task state configuration, execution-based assessment, and interactive learning across major operating systems like Ubuntu, Windows, and macOS. Moreover, it provides the capability to evaluate open-ended computer tasks, encompassing activities from image viewing and software feature integration to programming. Hence, OSWorld serves as a unified real computing environment where users can define their agent tasks without the need to construct simulation environments tailored to specific applications or domains.
3.4 Training Paradigm
During training, we employ a three-stage training strategy (as shown in Figure 4), in which consists of two SFT stages for perception and planning tasks, as well as a RL stage with curriculum reinforcement learning framework. In the first stage, the perception task focuses on improving the basic understanding ability of the vision-language model in GUI scenarios, such as grounding, captioning, VQA, and OCR. In the second stage, the planning task concentrates on predicting the next action based on historical actions. In the final RL stage, the curriculum reinforcement learning framework aims to improve reasoning and exploration capacity via group relative policy optimization algorithm on trajectory data.
3.4.1 Stage 1: Perception Task
The perceptual abilities of a GUI agent are fundamental for enabling deep understanding and effective interaction with digital interfaces. Modern digital environments are increasingly complex, with user interfaces containing rich visual elements, diverse layouts, and embedded semantic information. Without robust perception capabilities, an agent would struggle to interpret the structure, content, and intent behind various UI components, thereby limiting its effectiveness in real-world applications.
To address this critical challenge, we initially enhance UItron’s perceptual ability in the first stage by fine-tuning it on a wide range of GUI perception scenarios. The goal of this fine-tuning is to systematically strengthen UItron’s ability to recognize and interpret interface elements, ensuring a deeper and more precise understanding of digital UIs. In particular, we focus on four core perception tasks: grounding, captioning, VQA, and OCR. Grounding enables the agent to accurately localize and associate semantic labels with interface components, establishing a clear mapping between visual regions and their meanings. Captioning facilitates the generation of natural language descriptions for UI layouts and elements, allowing the agent to summarize and communicate interface structures effectively. Furthermore, VQA empowers the agent to answer queries about the interface by integrating both visual and semantic cues, supporting interactive and context-aware understanding. OCR extracts embedded textual information from the interface, ensuring that no detail is overlooked and that all relevant data is accessible for downstream reasoning.
Through mastering these perception tasks, the fine-tuned UItron attains a holistic and nuanced understanding of user interfaces. This comprehensive perceptual foundation not only enhances its ability to interpret complex digital environments, but also lays a solid groundwork for advanced reasoning, planning, and autonomous interaction in subsequent stages.
3.4.2 Stage 2: Planning Task
In this stage, training with planning tasks aims to output the predicted actions defined in Equation (1), then optimize the output using a generative loss in an auto-regressive manner. Effective planning is a key capability that enables a GUI agent to execute purposeful actions and navigate complex digital environments.
The centra idea of planning task is to capture next actions for forward planning and historical actions for backtracking. To this end, we construct two types of training data in the planning task, one for forward planning and the other for backward backtracking. The forward planning follows the message organization approach defined in Equation (1), which inputs historical actions and the environment observation to output next action . In contrast, the backward backtracking follows the message organization approach defined as follows:
[TABLE]
Here the main difference is that the previous action is not provided in the input, while the agent need to predict .
3.4.3 Stage 3: Curriculum Reinforcement Learning
To enhance the reasoning ability, UItron develops a curriculum reinforcement learning framework for performing group relative policy optimization (GRPO) shao2024deepseekmath algorithm on trajectory data. It first computes dense rewards from the action steps in the offline environment (simple), and then computes the task-level reward for the trajectory data in the online environment (complex).
GRPO
We adapt the Group Relative Policy Optimization (GRPO) shao2024deepseekmath algorithm for RL. For each input , the policy samples a group of candidate responses .
[TABLE]
where and are hyperparameters, and , , and are the model after SFT, the optimized model and the old policy model. The group-normalized advantage for the -th response is:
[TABLE]
Offline RL Since successful trajectories of GUI agents in online environments are usually rare, the offline RL collects rewards for each action step in the offline environment to avoid sparse rewards. For each action prediction, it generates multiple candidate actions for the same input and calculates the GRPO loss to improve reasoning and exploration capabilities.
Online RL The online RL is built on the interactive infrastructure shown in Figure 3. It collects rewards for the entire trajectory through the rollout dialogues in an online environment. For each task, online RL allows the model to freely explore all possible action plans until it reaches the maximum number of steps or generates an end signal. To this end, online RL utilizes the advanced vision-language model as a scoring model to evaluate whether the task is completed based on the entire trajectory and task, and outputs a reward signal of 0 or 1. To improve the credibility of rewards in the RL process, we incorporate multiple scoring models from different vision-language models. Besides, we also strictly filter the trajectories that are predicted correctly by multiple scoring models simultaneously. Finally, the online RL calculates the trajectory-level GRPO loss based on multiple sampled trajectories, thereby improving the exploration ability in the online environment.
Summary Finally, we produce two versions named UItron and UItron-RL, both of which are based on the Qwen25-VL model structure, but with different parameter weights. The former is obtained after training in stages 1 and 2, while the latter is obtained after reinforcement training in stage 3. In the experiments, we report the results of uitron-RL in all online environments, and report the results of UItron in other offline scenarios.
4 Experiments
We carry out extensive experiments covering scenarios including GUI perception, grounding, offline planning, and online planning. In particular, we also built our own Chinese scenario evaluation and conduct experiments to explore the improvement of Chinese capabilities.
4.1 Evaluation of GUI Perception
VisualWebBench.
We evaluate our model’s screen perception capabilities on VisualWebBench liu2024visualwebbench , a comprehensive benchmark containing multiple website-based tasks. For the Grounding Tasks, we measure prediction accuracy by requiring the agent to select correct answers from set of masks (SoM) on screenshots.
Complicated Perceptual Benchmarks.
To evaluate the model’s ability to comprehend abstract instructions, we follow liu2024harnessing by assessing visual grounding of natural language-described elements through RefExp bai2021uibert and testing the reverse task of element captioning on WidgetCap li2020widget . We additionally evaluate on general Web QA task of WebSRC chen2021websrc , requiring textual and structural understanding of GUI elements, for further assessing comprehensive perceptual capabilities.
Baseline Models.
We compare our UItron with SOTA models in both understanding and GUI operation task. Among general VLLMs, we use GPT-4o hurst2024gpt and Qwen2.5-VL bai2025qwen2 as our baseline for their powerful understanding capabilities in general task understanding; among GUI-related VLLMs, we compare our UItron with MultiUI liu2024harnessing and UI-TARS qin2025ui , the former is specialized in GUI understanding while the later one is the SOTA model in GUI tasks.
As shown in Tables 3 and 4, our UItron demonstrates superior performance on perceptual tasks, establishing crucial groundwork for subsequent planning and reasoning capabilities essential for GUI task execution. This effectiveness stems from the limited understanding data employed in both training stages 1 and 2, which not only mitigates the spurious forgetting issue zheng2025spurious that degrades baseline VLLM’s original comprehension, but also enhances GUI-specific understanding. This results indicate that maintaining the generalist model’s capabilities relevant to the the downstream task while developing specialized skills is critical for creating effective specialist agents. Details of the benchamrks are listed in Table 1.
4.2 Evaluation of GUI Grounding
ScreenSpot.
We use ScreenSpot cheng2024seeclick to assess the fundamental GUI-understanding and element-grounding accuracy of GUI-agent models. The ScreenSpot benchmark comprises more than 600 screenshots and 1,200 instructions, spanning multiple platforms—iOS, Android, macOS, Windows, and web pages. We report separate results for Text and Icon/Widget elements on the Mobile, Desktop, and Web splits of ScreenSpot, together with the micro accuracy aggregated across all platforms.
ScreenSpot-V2.
Similar with ScreenSpot cheng2024seeclick , we also employ ScreenSpot-V2 wuatlas for evaluation, which is a GUI benchmark that advances from basic recognition to cross-modal reasoning. This enhanced version better reflects real-world complexity through optimized annotations, expanded task types, and improved data diversity. The benchmark contains 1,272 instructional samples paired with 756 images, drawing from data sources similar to ScreenSpot.
The experimental results in Table 6 demonstrate that UItron exhibits impressive leading GUI grounding performance across all platforms. This advantage is primarily attributed to UItron’s adoption of data engineering specifically tailored for GUI agents, which provides high-quality and well-defined datasets for model training. Furthermore, the parameter scaling experiments of UItron indicate that, with sufficient and high-confidence training data, the model’s grounding capability is further enhanced as its scale increases. Compared with state-of-the-art model (i.e., UI-TARS) that additionally utilize internal data, UItron-72B relies solely on open-source data, achieves a 2.1% improvement in micro grounding accuracy.
4.3 Evaluation of Offline Planning
AndroidControl.
AndroidControl li2024effects is a benchmark for evaluating the planning and action-execution capabilities of GUI agents on Android devices. It contains 15,283 episodes of everyday tasks across 833 distinct applications, making it the most diverse UI-control dataset to date. Following standard practice, we report results under two settings. AndroidControl-Low: At every step the agent receives a screenshot together with a natural-language description of the required action and must predict both the action type and its exact parameters. AndroidControl-High: Only the high-level task goal and the current screenshot are provided at each step. The agent must autonomously plan the entire procedure and output the correct action together with its parameters. Following OS-Atlas wuatlas , we reserve 1,000 episodes as an out-of-domain evaluation set and report the action-type accuracy, grounding accuracy, and average step success rate.
General-purpose LLMs such as GPT-4o and Claude demonstrate reasonable action-type accuracy in the Low setting, but their grounding accuracy is essentially zero and their step success rates are very low, indicating a lack of fine-grained perception and UI localization capability. In contrast, specialized GUI models like SeeClick, Aria-UI, OS-Atlas, AGUVIS, and UI-TARS exhibit a clear advantage, achieving substantially higher scores across all metrics. This demonstrates the superiority of dedicated GUI models in both accurately identifying UI elements and executing precise actions. Notably, our model, UItron, surpasses all others: UItron-72B achieves the highest grounding and step success rates in both Low and High settings, showcasing exceptional performance not only in guided UI action execution but also in autonomous planning. This underscores the critical importance of our model’s unified approach to perception, grounding, and planning, enabling robust and generalizable UI control.
GUI-Odyssey.
GUI-Odyssey lu2024gui is used for evaluating cross-app navigation agents, surpassing the limitation of other benchmarks that are restricted to a single app. It consists of 7,735 episodes, six types of cross-app tasks, 201 apps, and 1.4k app combinations. GUI-Odyssey-Random/Task/Device/App are four different test subsets, with statistics shown in Table 2. It aims to assess the generalization ability of autonomous GUI agents across different applications, tasks, and device setups. Following OS-Atlas wuatlas , we report the macro average performance across these subsets.
The challenge of cross-app navigation exposes even greater limitations in general-purpose LLMs, with GPT-4o and Claude displaying poor performance in both grounding and step success rate, and action-type accuracy dropping further compared to single-app scenarios. Specialized GUI models again demonstrate their superiority, with SeeClick, Aria-UI, and OS-Atlas showing solid results, and UI-TARS achieving state-of-the-art performance. Notably, UItron remains highly competitive, achieving results close to those of UI-TARS in most metrics. While UItron may perform slightly below UI-TARS on certain cross-app tasks, it consistently demonstrates top-tier results when considering both AndroidControl and GUI-Odyssey benchmarks together, highlighting its overall superiority in comprehensive UI understanding and control. UItron’s strong performance across diverse tasks and app combinations underscores its robust generalization ability and reliability, making it one of the most effective agents for complex, real-world UI navigation tasks.
4.4 Evaluation of Online Planning
OSWorld.
We use OSWorld OSWorld to evaluate the performance of GUI agent models as online agents on personal computer (PC) platforms. OSWorld is a real computer environment that supports multimodal agents in task setup and execution evaluation across multiple operating systems. It includes a benchmark of 369 tasks covering real-world web and desktop applications, OS file I/O, and workflows across applications.
Baselines.
We compare our method with two types of agentic methods, namely GUI agent and computer-use agent. The GUI agent is a typical method that considers both Mobile and PC scenarios, while the compute-use agents are some recent methods that are specifically designed for PC scenario. For GUI agent, we select several advanced baselines including Augvis-72B xu2025aguvis , UI-TARS-72B qin2025ui , UI-TARS-1.5-7B qin2025ui (72B version is closed source). We also compare with Qwen2.5-VL-72B bai2025qwen2 to demonstrate the improvement gains via several training stages. For compute-use agent, we select several advanced baselines including OpenAI CUA OpenAI , Claude CUA Anthropic and OpenCUA wang2025opencua . All methods adopt the same setting of maximum length of 15 steps for fair comparison.
Results.
Table 8 reports the comparative results of UItron and other baseline methods. From the results, we observe that specialized CUA agents generally outperform GUI agents, primarily due to their more singular scenarios and objectives. We can also see that Uitron achieves competitive performance in GUI agents, with only a small gap compared to the state-of-the-art UI-Tars-1.5 method. In addition, the experimental results also show that existing vision-language models such as Qwen25-VL suffers from poor performance, which can be greatly improved through a large amount of targeted training in GUI scenarios.
4.5 Evaluation of Chinese Scenario
Evaluation Data
We evaluate the effectiveness of our method in both offline and online environments. To support comprehensive evaluation, we constructed test data and an Android cloud environment. We manually annotate 545 trajectory steps from 109 universal tasks across several apps, and verify that these test tasks did not overlap with the training tasks. Considering that some tasks in the online environment have some app automatic login risks and failures, we retain 86 tasks that can be completed in the online environment.
Evaluation Metrics
We design different evaluation metrics for GUI agents in offline and online environments. For evaluation in offline environment, in which each predicted action corresponding to a ground-truth action, we directly calculate accuracy to evaluate the single-step success rate (i.e., Step SR) and task success rate (i.e., Task SR). A task is deemed successful when all execution steps exactly align with the ground-truth action sequence. For evaluation in online environment, the GUI agent freely explores all possible actions according to finish the task without any ground-truth action. Therefore, we evaluate whether the task is completed accurately based on the complete execution trajectory. We leverage an advanced visual-language model (i.e., GPT-4o hurst2024gpt ) to determine whether the task is completed. The result is 1 for completion and 0 for incomplete.
Offline Results
Table 9 reports the Step SR and Task SR results of UItron and baseline methods. From the results, we can see that both UItron-7B and UItron-72B significantly outperform the baseline methods in all evaluation metrics, demonstrating their superiority in Chinese scenarios. Interestingly, for the Step SR and Task SR indicators, the results indicate that they have a positive correlation, but the difference in Task SR is significantly larger, which is probably because Task SR reflects the more rigorous accumulation of Step SR. Therefore, we can see from the results that different methods are relatively close in terms of Step SR, but have significant differences in terms of Task SR. The advanced performance of UItron primarily stems from learning page organization and interaction logic through extensive data from Chinese scenarios, which exhibit significant differences compared to traditional English contexts.
Online Results
Table 10 reports the Task SR results of UItron and baseline methods. The results indicate that UItron outperforms the baseline model with a significant performance advantage, verifying its better interaction and exploration capabilities in online environments. Another noteworthy phenomenon is that the online evaluation results of the same model consistently surpass its offline evaluation results, a trend often overlooked in previous research due to the lack of comparable offline and online tasks. The explanation for this phenomenon lies in the nature of the online environment, which offers GUI agents ample space to explore and recover from errors with relaxed constraints. In this setting, certain erroneous steps can be rectified by returning to the original step. Conversely, in offline evaluations, any failed step inevitably results in task failure.
5 Conclusion and Future Work
This paper presents UItron, a pioneering open-source foundational model designed to enhance the capabilities of GUI agents in executing complex tasks across digital environments such as PCs and Mobile devices. UItron conduct sufficient investigation of data engineering and interactive infrastructure to handle the scarcity of annotated trajectory data. It systematically compares various data strategies to improve the training effectiveness. UItron adopts a typical training paradigm of GUI grounding and planning, and then develops a curriculum reinforcement learning method that improves complex reasoning and exploration in the online environment. In particular, UItron also emphasizes the importance of Chinese interaction capabilities in practical GUI agent deployment. Through comprehensive annotation of over one million action steps from leading Chinese apps, UItron achieves superior results in realistic offline and online evaluation frameworks, bringing GUI agents closer to practical deployment. Experimental results demonstrate that UItron achieves superior performance in benchmarks of GUI perception, task localization and planning, as well as a significant advance in Chinese application scenarios.
In summary, UItron offers an open-source foundation that facilitates the future development of GUI agents. In our future work, we will investigate the intrinsic thinking patterns underlying the interpretive action behaviors of GUI agents, as we have observed frequent ambiguities and inconsistencies between thinking and action outputs in our method. Furthermore, we will study multi-agent collaboration strategies to systematically explore capabilities such as reflection, backtrace, and correction, considering that current single-agent methods often struggle to simultaneously handle both visual and textual aspects. Lastly, we plan to build the unified agentic infrastructure and reinforcement learning environment that integrate capabilities such as coding, tool-use and function-call, spanning both the 2D digital world and the 3D physical realm.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774 , 2023.
- 2[2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems , 35:23716–23736, 2022.
- 3[3] Anthropic. Introducing claude 4.5. URL https://www.anthropic.com/news/claude-4 , 2025.
- 4[4] Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, et al. Uibert: Learning generic multimodal representations for ui understanding. ar Xiv preprint ar Xiv:2107.13731 , 2021.
- 5[5] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ar Xiv preprint ar Xiv:2308.12966 , 1(2):3, 2023.
- 6[6] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen 2. 5-vl technical report. ar Xiv preprint ar Xiv:2502.13923 , 2025.
- 7[7] Gongwei Chen, Xurui Zhou, Rui Shao, Yibo Lyu, Kaiwen Zhou, Shuai Wang, Wentao Li, Yinchuan Li, Zhongang Qi, and Liqiang Nie. Less is more: Empowering gui agent with context-aware simplification. In International Conference on Computer Vision , 2025.
- 8[8] Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt 4v-synthesized data for lite vision-language models. ar Xiv preprint ar Xiv:2402.11684 , 2024.
