CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

Zeyi Sun; Yuhang Cao; Jianze Liang; Qiushi Sun; Ziyu Liu; Zhixiong Zhang; Yuhang Zang; Xiaoyi Dong; Kai Chen; Dahua Lin; Jiaqi Wang

arXiv:2508.20096·cs.CV·August 28, 2025

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

Zeyi Sun, Yuhang Cao, Jianze Liang, Qiushi Sun, Ziyu Liu, Zhixiong Zhang, Yuhang Zang, Xiaoyi Dong, Kai Chen, Dahua Lin, Jiaqi Wang

PDF

1 Models 3 Reviews

TL;DR

CODA is a trainable compositional framework that combines a generalist planner and a specialist executor, enabling effective long-horizon planning and precise execution in scientific GUI tasks, with improved generalization and performance.

Contribution

Introduces CODA, a novel two-stage training pipeline for a dual-brain agent integrating a generalist planner and a specialist executor, addressing data scarcity and adaptability in scientific domains.

Findings

01

Outperforms baselines on ScienceBoard benchmark

02

Achieves state-of-the-art results among open-source models

03

Demonstrates effective cross-domain generalization

Abstract

Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in specialized domains such as scientific computing, where both long-horizon planning and precise execution are required. Existing approaches suffer from a trade-off: generalist agents excel at planning but perform poorly in execution, while specialized agents demonstrate the opposite weakness. Recent compositional frameworks attempt to bridge this gap by combining a planner and an actor, but they are typically static and non-trainable, which prevents adaptation from experience. This is a critical limitation given the scarcity of high-quality data in scientific domains. To address these limitations, we introduce CODA, a novel and trainable compositional framework that integrates a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained via a dedicated two-stage pipeline. In the…

Tables2

Table 1. Table 1: Success rates of various models on ScienceBoard (Sun et al., 2025a ) . Proprietary models and open-sourced models based methods are highlighted with purple and green backgrounds, respectively. *Indicates specialist agents trained separately for each software with ensembled results.

Metrics	Model	Success Rate ( $↑$ )
Metrics	Model	Algebra	Biochem	GIS	Astron	Overall
Average@1	\cellcolorlightpurpleGPT-4o (OpenAI, 2023)	3.23%	0.00%	0.00%	0.00%	0.81%
	\cellcolorlightpurpleClaude-3.7-Sonnet (Anthropic, 2025)	9.67%	37.93%	2.94%	6.06%	14.15%
	\cellcolorlightpurpleGemini-2.0-Flash (Team et al., 2023)	6.45%	3.45%	2.94%	6.06%	4.73%
	\cellcolorlightpurpleGPT4o $\overset{}{\to}$ UGround-V1-7B (Gou et al., 2024)	0.00%	3.45%	0.00%	3.03%	1.62%
	\cellcolorlightpurpleGPT4o $\overset{}{\to}$ OS-Atlas-Pro-7B (Wu et al., 2024b)	6.25%	10.34%	0.0%	3.03%	4.92%
	\cellcolorlightpurpleGPT4o $\overset{}{\to}$ UI-TARS-72B (Qin et al., 2025)	3.23%	10.34%	5.88%	6.06%	6.38%
	\cellcolorlightgreenQwen2.5-VL-72B (Bai et al., 2025)	22.58%	27.59%	5.88%	9.09%	12.94%
	\cellcolorlightgreenInternVL3-78B (Zhu et al., 2025)	6.45%	3.45%	0.00%	0.00%	2.69%
	\cellcolorlightgreenUI-TARS-1.5-7B (Qin et al., 2025)	12.90%	13.79%	0.00%	6.06%	8.19%
Average@8	\cellcolorlightgreenQwen2.5-VL-32B (Bai et al., 2025)	10.48%	13.79%	1.47%	4.55%	7.57%
	\cellcolorlightgreenUI-TARS-1.5-7B (Qin et al., 2025)	6.49%	10.24%	0.80%	3.03%	5.14%
	\cellcolorlightgreenCODA (Stage-1)*	13.71%	26.29%	7.72%	9.85%	14.39%
	\cellcolorlightgreenCODA (Stage-2)	20.16%	32.23%	14.71%	17.05%	21.04%
Pass@8	\cellcolorlightgreenQwen2.5-VL-32B (Bai et al., 2025)	29.03%	31.03%	8.82%	9.09%	19.49%
	\cellcolorlightgreenUI-TARS-1.5-7B (Qin et al., 2025)	19.35%	24.14%	5.88%	12.12%	15.36%
	\cellcolorlightgreenCODA (Stage-1)*	41.94%	44.83%	23.53%	18.18%	32.12%
	\cellcolorlightgreenCODA (Stage-2)	48.39%	51.72%	29.41%	30.30%	39.96%

Table 2. Table 2: Evaluation of different judge methods on AgentRewardBench (Lù et al., 2025 ) and ScienceBoard (Sun et al., 2025a ) .

Method	AgentRewardBench (Lù et al., 2025)		ScienceBoard (Sun et al., 2025a)
Method	Precision	Recall	Precision	Recall
Qwen2.5-VL-72B-single	64.5	83.4	41.5	80.1
72B-GUI-Judge	73.5	79.0	43.7	80.1
72B-voting@4	76.1	79.5	58.6	75.3
72B-voting@4 w/ multi-res	78.9	77.4	65.7	77.9
72B-voting@4 Ensemble	81.2	76.8	69.5	74.2

Equations15

a_{t} = π (g, (o_{1}, a_{1}, \dots, a_{t - 1}, o_{t}))

a_{t} = π (g, (o_{1}, a_{1}, \dots, a_{t - 1}, o_{t}))

p_{t} = Planner (m_{t - 1}, o_{t - 1}, o_{t})

p_{t} = Planner (m_{t - 1}, o_{t - 1}, o_{t})

a_{t} = Executor (m_{t - 1}, o_{t - 1}, o_{t}, p_{t})

a_{t} = Executor (m_{t - 1}, o_{t - 1}, o_{t}, p_{t})

r^{(i)} = r (a^{(i)}, a_{T}) = I (type (a^{(i)}) = type (a_{T})) + r_{dist} (a^{(i)}, a_{T}),

r^{(i)} = r (a^{(i)}, a_{T}) = I (type (a^{(i)}) = type (a_{T})) + r_{dist} (a^{(i)}, a_{T}),

A^{(i)} = \frac{r ^{(i)} - mean ({ r ^{(j)} } _{j = 1}^{G} )}{std ({ r ^{(j)} } _{j = 1}^{G} )}, i = 1, \dots, G .

A^{(i)} = \frac{r ^{(i)} - mean ({ r ^{(j)} } _{j = 1}^{G} )}{std ({ r ^{(j)} } _{j = 1}^{G} )}, i = 1, \dots, G .

L_{GRPO} (π_{θ})

L_{GRPO} (π_{θ})

\displaystyle\Bigg{[}\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|p^{(i)}|}

where r^{i, t} (θ) = \frac{π _{θ} ( p ^{(i)} ∣ s , I )}{π _{θ_{ref}} ( p ^{(i)} ∣ s , I )} and D_{KL}^{i, t} (π_{θ}, π_{ref}) = \frac{π _{ref} ( p ^{(i)} ∣ s , I )}{π _{θ} ( p ^{(i)} ∣ s , I )} - 1 - lo g \frac{π _{ref} ( p ^{(i)} ∣ s , I )}{π _{θ} ( p ^{(i)} ∣ s , I )} .

where r^{i, t} (θ) = \frac{π _{θ} ( p ^{(i)} ∣ s , I )}{π _{θ_{ref}} ( p ^{(i)} ∣ s , I )} and D_{KL}^{i, t} (π_{θ}, π_{ref}) = \frac{π _{ref} ( p ^{(i)} ∣ s , I )}{π _{θ} ( p ^{(i)} ∣ s , I )} - 1 - lo g \frac{π _{ref} ( p ^{(i)} ∣ s , I )}{π _{θ} ( p ^{(i)} ∣ s , I )} .

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

Generalization Across Software: By leveraging a specialist-to-generalist approach, CODA achieves strong generalization across novel software environments. Its ability to adapt to different software systems without requiring human-labeled data is a significant improvement over many existing systems.

Weaknesses

Ambitious Design: The idea of decoupling high-level planning from low-level execution is an interesting attempt to mimic human cognition. The use of a "Cerebrum" and "Cerebellum" model, while conceptually engaging, feels overly complex for the problem at hand. It’s as though the authors were trying too hard to sound cerebral, when a simpler solution might suffice. Potential Overfitting: While the system is trained to generalize, there is a risk of overfitting during the specialization phase, es

Reviewer 02Rating 6Confidence 4

Strengths

The paper is generally well-presented and easy to follow. The analogy to the human brain's cerebrum-cerebellum division provides an intuitive conceptual framework that helps readers understand the motivation for decoupling planning from execution. The experimental evaluation demonstrates that the presented framework performs strongly better than closed larger models. The fact that the model trained on ScienceBoard shows meaningful performance on the unseen OSWorld benchmark is a good sign as we

Weaknesses

While CODA works better than closed models would have been good to include other agentic frameworks for comparison. Also Table 1 needs to be explained better. it is not very clear what Average @1, 8... stand for Minor thing, LVLM acronym is introduced before explaining what is it (first parag of Sec 2)

Reviewer 03Rating 6Confidence 3

Strengths

1. While the formalism of decoupled planner-executor design for agentic framework is well established, this work still provides good insights into further fine-tuning the planner with frozen executor can yield significant benefits in domains like GUI agents, where the low-level grounding can be performed with high accuracy but high-level planning is still challenging with the lack of domain knowledge. 2. The proposed automated judge/reward system largely alleviate the need for human labels, furt

Weaknesses

1. While the author states that in stage 1 training they train four specialist models for each software in ScienceBoard. However, as far as I understand, ScienceBoard contains tasks across 6 domains with one software per-domain, so how can four specialized agents cover six softwares? Moreover, as mentioned in ScienceBoard, there are cross-application scenarios which requires more than one software to accomplish the tasks, how are these software specialized models handle cross-application tasks?

Code & Models

Models

🤗
OpenIXCLab/CODA-PLANNER-TARS-32B
model· 8 dl· ♡ 3
8 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

CODA: COordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning.

Zeyi Sun*∗1,2*, Yuhang Cao*∗2*, Jianze Liang*∗2*, Qiushi Sun*∗4*, Ziyu Liu*∗1,2* Zhixiong Zhang1,2

Yuhang Zang†2, Xiaoyi Dong2,3, Kai Chen2, Dahua Lin2,3, Jiaqi Wang*†2***

1Shanghai Jiao Tong University 2Shanghai AI Laboratory

3The Chinese University of Hong Kong 4The University of Hong Kong

Abstract

Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in specialized domains like scientific computing, require both long-horizon planning and precise, fine-grained execution. Existing approaches suffer from a trade-off: generalist agents excel at planning but falter in execution, while specialized agents show the opposite weakness. While recent compositional frameworks attempt to bridge this gap by combining a ”planner” and an ”actor,” they are typically static and non-trainable, preventing adaptation from experience—a critical limitation given the scarcity of high-quality data in scientific domains. To address these limitations, we introduce CODA, a novel and trainable compositional framework that synergizes a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained with a dedicated two-stage training pipeline. The first stage, Specialization, employs a decoupled GRPO approach to train an expert planner for each scientific application individually, bootstrapping from a small set of initial task trajectories. The second stage, Generalization, aggregates all successful trajectories from all specialized experts. This consolidated, high-quality dataset is then used to perform supervised fine-tuning (SFT) on the final planner, equipping it with the robust, cross-domain capabilities of a generalist. Evaluated on four challenging applications from the ScienceBoard benchmark, our framework significantly outperforms the baseline and establishes a new state-of-the-art (SOTA) among open-source models. Our models and code are available at https://github.com/OpenIXCLab/CODA.

††footnotetext: $\dagger$ Corresponding Authors. ∗ Equal contribution

1 Introduction

Autonomous agents for Graphical User Interfaces (GUIs) (Anthropic, 2024; OpenAI, 2025; Qin et al., 2025; Lin et al., 2024; Wu et al., 2024b; Hong et al., 2023) promise to automate a wide range of digital tasks (Zhou et al., 2023; Xie et al., 2024). However, their application in specialized domains such as scientific computing and engineering analysis remains highly challenging (Sun et al., 2025a). These environments pose two primary difficulties: first, their interfaces are highly complex, requiring precise and fine-grained actions; second, the problems they address are intrinsically complicated, demanding long-horizon planning to achieve effective solutions.

Effective agency for computer task automation in these domains requires both high-level planning and low-level execution as well as domain knowledge. However, current models exhibit a clear trade-off. Generalist models like Qwen2.5-VL (Bai et al., 2025) provide robust planning capabilities but often struggle with the precise grounding needed for reliable execution. Conversely, specialized agents (Wu et al., 2024b; 2025; Xie et al., 2025) like UI-Tars (Qin et al., 2025) are highly proficient in execution, yet their capacity for complex, high-level planning is more constrained.

To bridge this gap, a natural approach has been to develop compositional frameworks that explicitly decouple planning from execution, effectively pairing a generalist “cerebrum” with a specialist “cerebellum” (Agashe et al., 2024; 2025). While promising, these pioneering approaches are fundamentally limited. They are typically static and non-trainable, relying on powerful, often closed-source models as their core planner. This design introduces significant drawbacks: it compromises transparency and replicability, and most critically, prevents the agent from learning and adapting through experience.

This architectural decoupling is not merely an engineering convenience but is inspired by the functional architecture of the human brain (illustrated in Fig.1). The specialization of high-level planning (the Cerebrum) and low-level motor control (the Cerebellum) is a key aspect of human intelligence. Crucially, these structures exhibit different learning patterns: the Cerebellum, once mature, provides stable and broadly applicable motor skills that require infrequent updates Ito (2000). In contrast, the Cerebrum continuously adapts its strategies based on the nuances of new tasks and environments Demarin & Morović (2014); Hallett (2005). This biological parallel motivates our core hypothesis: an effective agent should pair a stable, proficient grounding model with a dynamic planner that is specialized for different software domains through targeted, experience-driven learning.

To realize this vision, we propose a trainable compositional framework that integrates Qwen2.5-VL Bai et al. (2025) as the planner (cerebrum) and UI-Tars-1.5 Qin et al. (2025) as the executor (cerebellum). Unlike prompting-based systems that rely on proprietary closed-source planners, our framework makes the planner itself learnable through interaction with software environments mediated by a static executor. Concretely, the executor provides stable, software-agnostic grounding for low-level GUI actions, while the planner, by leveraging this reliable interface, can gradually acquire domain-specific knowledge and improve its high-level planning strategies. In contrast to end-to-end training of a full agent, which requires massive amounts of specialized data and costly retraining of both perception and execution modules, our decoupled approach is substantially more data efficient: only the planner is optimized for domain adaptation, while the executor remains fixed as a general-purpose grounder that already possesses strong generalization ability after massive pretraining for grounding purposes. This design reduces reliance on curated trajectories, lowers training cost, and ensures controllable adaptation.

To train the planner effectively under this cerebrum–cerebellum separation, we avoid the need for costly human-labeled trajectories. Instead, we leverage a judging system built from open-source models to automatically provide dense reward signals, combined with autonomous interaction with scientific software environments through the static executor. This setup enables the planner to gradually acquire domain-specific planning ability with zero human effort. Furthermore, by distributing the interaction process across multiple software environments in parallel—coordinated by a central master—we can significantly accelerate reinforcement learning. This strategy not only makes the training process more efficient but also echoes our brain-inspired design: the cerebellum-like executor delivers stable grounding, while the cerebrum-like planner continually adapts through experience.

We validate our framework on four typical scientific software applications from the ScienceBoard benchmark (Sun et al., 2025a). Experiments show that our method not only significantly improves the baseline performance (Cerebrum: Qwen2.5-32B-VL, Cerebellum: UI-Tars-1.5) but also establishes a new state-of-the-art (SOTA) among open-source models, confirming its effectiveness.

2 Related Works

Reinforcement Learning for LVLMs. Training for LLMs and LVLMs (Touvron et al., 2023; Grattafiori et al., 2024; Liu et al., 2023a; Bai et al., 2025; Wang et al., 2024; Xing et al., 2025; Sun et al., 2024d; c; Ding et al., 2025) has progressed from data-intensive Supervised Fine-Tuning (SFT) (Liu et al., 2023a; Wei et al., 2022) towards Reinforcement Learning (RL). Algorithms like Group Relative Policy Optimization (GRPO) (Guo et al., 2025; Shao et al., 2024) have proven effective for reasoning tasks, moving beyond earlier single-turn RLHF applications (Ouyang et al., 2022; Ziegler et al., 2019; Rafailov et al., 2023). However, applying RL to complex agentic tasks (Bai et al., 2024; Qi et al., 2024; Zhou et al., 2024; Zhai et al., 2024; Carta et al., 2023) is challenging. Prevailing methods train monolithic agents end-to-end, often requiring co-trained critic models (Schulman et al., 2015) or preference-based optimization like DPO (Rafailov et al., 2023; Putta et al., 2024; Qin et al., 2025), which problematically entangles the distinct skills of planning and execution. In contrast, our work employs a decoupled reinforcement learning strategy: the high-level planner is optimized via environmental interaction while the execution model remains fixed. We adapt GRPO by computing rewards from the final action and backpropagating the advantage exclusively through planning tokens. This targeted optimization stably enhances strategic planning, distinguishing our method from prior works that train dedicated critic models (Bai et al., 2024; Qi et al., 2024) or use filtered behavior cloning (Pan et al., 2024; Chen et al., 2020).

Computer Use Agent. Fueled by advancements in Large Vision-Language Models (LVLMs) (Touvron et al., 2023; Grattafiori et al., 2024; Liu et al., 2023a; Bai et al., 2025; Wang et al., 2024), a new generation of agents capable of operating computers via multi-modal inputs is emerging (Hu et al., 2024b; Hong et al., 2024; Cheng et al., 2024; Nguyen et al., 2024; Lin et al., 2024; Sun et al., 2024b). Whether processing structured text and code (Qi et al., 2024; Putta et al., 2024; Lai et al., 2024; Sun et al., 2024a; Nakano et al., 2021) or screenshots (Hong et al., 2023; Lin et al., 2024; Wu et al., 2024b; OpenAI, 2025), these agents face an inherent dichotomy analogous to human cognition: the tension between high-level strategic planning and precise, low-level action execution. This has motivated the development of compositional frameworks that decouple these responsibilities (Agashe et al., 2024; 2025; Liu et al., 2023b; Zhang et al., 2025; Song et al., 2025). However, a significant portion of this research relies on static, non-trainable systems that orchestrate powerful, often proprietary models (Anthropic, 2024; OpenAI, 2025; Google DeepMind, 2025; Yan et al., 2023; He et al., 2024; Zhang et al., 2024; Wang et al., 2023; Wu et al., 2024a) as their core planner. This design fundamentally prevents the agent from adapting through experience—a critical flaw for mastering novel software where interaction data is scarce. Our work charts a different course by exploring reinforcement fine-tuning of the planner. By enabling the planner to learn specialized domain knowledge through direct software interaction via a fixed execution model, our strategy achieves robust performance on unfamiliar applications.

3 Method

3.1 Problem Formulation

We formally define the task of autonomous GUI operation for software workflows as a Partially Observable Markov Decision Process (POMDP). Each task is initiated with a natural language instruction $g$ from the task space $\mathcal{G}$ . At each timestep $t$ , the agent perceives the latent environment state $s_{t}\in\mathcal{S}$ through a visual observation $o_{t}\in\Omega$ , consisting of a screenshot of the user interface. The agent’s behavior is governed by a policy $\pi$ , instantiated by a large vision-language model, which synthesizes an action program $a_{t}\in\mathcal{A}$ . The action space $\mathcal{A}$ consists of precisely parameterized pyautogui scripts, where precision in arguments (e.g., coordinates) is critical for execution. The policy generates this action based on the initial instruction and the history of interactions:

[TABLE]

This sequential process induces a state trajectory $\tau=(s_{0},s_{1},\dots,s_{T})$ with the maximum time step $T$ . A task is considered successful if the final state $s_{T}$ satisfies the predefined goal condition specified in $\mathcal{G}$ .

3.2 Model Architecture

To address the inherent trade-off in monolithic models, which struggle to balance long-horizon planning with precise action grounding, we propose a composite agent architecture that structures the decision-making process into a Planner-Executor framework. This design decouples the task into two distinct yet collaborative modules: a high-level Planner responsible for strategic thinking and a low-level Executor for concrete action execution.

Planner

The Planner is instantiated from the Qwen2.5-VL (Bai et al., 2025) model. Its primary responsibility is to analyze the task’s progress and formulate a high-level, explicit plan $p_{t}$ for each step. Specifically, at each timestep $t$ , the Planner receives the interaction history up to the previous step $m_{t-1}=(p_{1},a_{1},\dots,p_{t-1},a_{t-1})$ , the current visual observation $o_{t}$ , and the preceding observation $o_{t-1}$ . The output is a structured thought, denoted as $p_{t}$ , which outlines the immediate objective and explicitly identifies the target UI elements for interaction. The process can be summarized as:

[TABLE]

Executor

The Executor employs a UI-TARS-1.5 (Qin et al., 2025) model. Its role is to translate the Planner’s abstract thought $p_{t}$ into a precise, executable action. The Executor is provided with the same historical and visual context as the Planner ( $m_{t-1}$ , $o_{t-1}$ , and $o_{t}$ ), but is critically augmented with the Planner’s newly generated thought $p_{t}$ . Its output is a low-level GUI action $a_{t}$ in the form of a ‘pyautogui’ command, such as ‘click(x, y)’. The Executor’s operation is defined as:

[TABLE]

3.3 Training Pipeline

Our training methodology employs a two-stage curriculum designed for initial specialization followed by broad generalization.

3.3.1 Stage 1: Specialization via Decoupled Reinforcement Learning

The primary objective of this initial training stage is to enhance the agent’s specialized performance on individual software applications.

Through empirical analysis, we observed that the Executor exhibits strong generalization capabilities, accurately translating well-structured plans into executable actions. However, the Planner module emerged as the primary bottleneck, often struggling to formulate effective high-level strategies. To address this, we adopt a decoupled training strategy that focuses reinforcement learning exclusively on the Planner ( $\pi_{\theta}=\text{Planner}$ ). This targeted approach allows us to refine the agent’s strategic reasoning without altering the already competent Executor.

Since the initial Planner is relatively weak and generates a limited number of successful trajectories, standard reinforcement learning methods can be inefficient. Therefore, we adapt the Group Relative Policy Optimization (GRPO) framework (Guo et al., 2025; Shao et al., 2024), which is particularly effective in such scenarios. GRPO can derive a meaningful learning signal by comparing the relative quality of different outputs, even when most of them are suboptimal.

The training process for a given task unfolds as follows. Given the current state and interaction history, the Planner first generates a group of $G$ candidate plans. Subsequently, the fixed Executor takes each plan as input and produces a corresponding low-level action. To generate a fine-grained learning signal, we compute a reward for each plan by comparing its resulting action $a^{(i)}$ to the labeled positive action $a_{T}$ (details of labeling process are in Sec.3.4 ). Our composite reward function assesses both the correctness of the action type and the precision of its parameters:

[TABLE]

Here, the indicator function $\mathbb{I}(\cdot)$ provides a binary reward for selecting the correct type of action (e.g., click vs. type). The term $r_{\text{dist}}(a^{(i)},a_{T})$ offers a continuous reward based on the parametric similarity between the predicted and ground-truth actions, such as L1 distance for coordinates or IoU for bounding boxes. These distance-based rewards are normalized to $[0,1]$ to ensure consistent scaling.

Once the rewards are calculated, they are used to derive a relative advantage $A^{(i)}$ for each plan, which is then fed into the GRPO loss function to update the Planner policy:

[TABLE]

The GRPO loss is formulated as follows:

[TABLE]

Consistent with the approach in (Shao et al., 2024; Guo et al., 2025), this advantage is applied across all reasoning tokens in the plan $p^{(i)}$ , encouraging the model to develop more robust and free-form planning capabilities.

3.3.2 Stage 2: Generalization via Aggregated Supervised Fine-Tuning

We adopt the specialist-to-generalist paradigm proposed in Sun et al. (2025b), where a generalist model is trained by leveraging multiple specialist models as teachers. We observe that directly applying reinforcement learning across all software leads to suboptimal performance. To address this, we first train four specialist models using the methods described in Sec. 3.3.1. These specialists are then employed to generate new trajectories for each software, which serve as supervision for training a generalist model. After learning from the four software-specific teachers, the resulting generalist not only surpasses its teachers in performance, but also demonstrates stronger reasoning and reflection abilities during planning, as well as broader domain knowledge across different software.

3.4 Auto Exploration Pipeline.

Auto Task Generation.

We employ Qwen2.5-72B (Wang et al., 2024) as the task generator to produce high-level tasks. Specifically, a small set of real human-instructed tasks on each software is provided as input, together with the prompt shown in Fig. 5. The agent then repeatedly executes these tasks to collect a diverse set of interaction trajectories, which are subsequently filtered by a judge system to retain only trajectories with positive actions for training.

Judge System for Providing Reward Signals.

Our judge system labels the positive actions $a_{T}$ within an agent’s trajectory when performing a task. Given a full trajectory $\mathcal{H}=\{o_{0},a_{0},\dots,o_{\text{final}}\}$ , the judge takes the complete sequence of screenshot observations $(o_{1},o_{2},\dots,o_{n})$ as input and outputs three signals: Correctness, Redundant, and First Error Step, using the detailed prompt shown in Fig. 6. A trajectory is considered clean and successful when Correctness is True and both Redundant and First Error Step are empty. In this case, all actions $a$ in the trajectory are labeled as $a_{T}$ . We present a detailed evaluation of the judge’s precision and discuss approaches for improving it in Sec. 4.2.

Distributed Virtual Machine System. Task execution is the most time-consuming step in our pipeline, so we developed a lightweight distributed system to accelerate large-scale trajectory curation. As illustrated in Fig. 3(b), the system follows an HTTP-based master–client architecture: the master node manages a dynamic task queue, monitors execution progress, and aggregates results, while multiple client nodes execute tasks in parallel within isolated virtual machine environments. This design enables efficient scaling to hundreds of concurrent environments, substantially reducing the time required to collect successful trajectories and making the framework well-suited for large-scale training and evaluation.

4 Experiments

4.1 Agent Performance Evaluation.

Our planner-executor approach is based on Qwen2.5VL-32B (Bai et al., 2025) serve as planner and UI-TARS-1.5-7B (Qin et al., 2025) serve as executor. We use method proposed in Sec.3.4 to generate high level tasks for each software from ScienceBoard (Sun et al., 2025a). through decoupled reinforcement learning proposed in Sec.3.3.1. During Training, the reward signal is provided by our judge system evaluated in Sec.4.2. Our training is based on OpenRLHF Hu et al. (2024a). As reported in Tab.1, our evaluation is done on four GUI centric software from ScienceBoard Sun et al. (2025a). We also report other planner-executor decoupled approaches. This first-stage reinforcement learning approach lead to significant performance gain compared to baseline.

In second stage, we use these specialist planner serve as teachers to teach a generalist planner. This new model is also initialized from Qwen2.5VL-32B and perform supervised fine-tuning on 0.77K trajectories from teacher models labeled by our judge system. As shown in tab.1, this new model surpass the performance of the ensemble of individual specialist, showing improved reasoning and planning abilities. This result demonstrates the effectiveness of our specialist-to-generalist strategy.

4.2 Towards Precise Judging System

Our reinforcement learning framework heavily relies on accurate judgments of agent trajectories to provide reliable reward signals. In this section, we present a detailed evaluation of our judge model, which demonstrates improved precision in decision making.

Settings. We conduct experiments on two sources of trajectories. (1) AgentRewardBench (Lù et al., 2025), a benchmark designed specifically for judge evaluation. (2) A trajectory dataset we collected from ScienceBoard (Sun et al., 2025a). We run Qwen2.5-VL-72B (Bai et al., 2025) on ScienceBoard tasks and extract 377 labeled trajectories, which are then used as inputs to our judge model. This setup allows us to quantitatively assess the judge’s ability to discriminate between successful and failed executions. We report Precision and Recall as our primary metrics. For voting-based strategies, we adopt a sampling temperature of $T=1.0$ and a nucleus sampling probability of $top\_p=0.6$ over 4 independent inference runs.

Results. As summarized in Table 2, our evaluations reveal three effective strategies for improving precision, building upon difference description fine-tuning (Sun et al., 2025b):

Voting. Instead of a single query, we prompt the model multiple times with high randomness ( $T=1.0$ , $top\_p=0.6$ ). A trajectory is only deemed successful if all votes agree, which significantly reduces false positives.
Multi-resolution inputs. Trajectories often include long sequences of high-resolution screenshots. We observe that using a mixture of resolutions across voting rounds is beneficial: low-resolution images help capture global execution dynamics, while high-resolution images aid in detecting fine-grained correctness. In practice, we first apply low-resolution inputs to quickly filter out failures, thereby improving both precision and efficiency.
Model ensembling. In addition to the fine-tuned judge model (see Sup. A), we find that ensembling two models within the voting strategy further enhances precision.

Across both ScienceBoard Sun et al. (2025a) and AgentRewardBench Lù et al. (2025), we observe a consistent progression: the fine-tuned model (72B-GUI-Judge) primarily improves recall, while voting substantially increases precision; multi-resolution inputs add further gains, and ensembling achieves the best balance with the highest precision while maintaining competitive recall. This consistent trend across benchmarks highlights the robustness and generality of our proposed strategies. With methods proposed in . This judge system provide high quality reward signal for the planner to perform RL to improve reasoning ability and learning software domain knowledge.

5 Conclusion

We presented a trainable Planner–Executor disentangled framework for GUI agents, inspired by the division of labor between the cerebrum and cerebellum. By coupling a fixed executor (UI-Tars-1.5) with a fine-tunable planner (Qwen2.5-VL), and supporting it with a robust judging system, GRPO-based exploration, and a distributed data generation pipeline, our approach effectively addresses the challenges of complex interfaces and long-horizon planning. Experiments on ScienceBoard applications demonstrate substantial improvements over strong baselines, establishing a new open-source state-of-the-art. These results highlight the importance of combining stable execution with adaptive planning, and open promising directions for extending our framework to richer multi-modal feedback, broader professional domains, and continual learning for long-term adaptability.

Appendix A Judge Model Fine-tuning Details

Inspired by Sun et al. (2025b), we adopt a fine-tuning approach to obtain a strong judge model. We scale up the model to Qwen2.5-VL-72B Bai et al. (2025), and use a dataset comprising 4.7K labeled judgment samples. These trajectories are generated by Qwen2.5-VL and Gemini-2.0-Pro on WebArena Zhou et al. (2023), UI-TARS-1.5 Qin et al. (2025), and GPT-4o OpenAI (2023) on OSWorld Qin et al. (2025). Judgments are provided by GPT-4o and Gemini2.5-Pro Google DeepMind (2025), with detailed captions for each screenshot frame during agent execution. The judgments are further filtered, retaining only those that align with verified ground-truth results. Additionally, change description data is incorporated inspired by SEAgent Sun et al. (2025b).

Training is conducted on 32 A100 GPUs for 370 steps, using LoRA Hu et al. (2022) with a rank of 8. The resulting model, trained on OSWorld trajectories, generalizes well to AgentRewardBench Lù et al. (2025) and ScienceBoard Sun et al. (2025a). This fine-tuned model is referred to as 72B-GUI-Judge in Table 2, and demonstrates improved precision on two out-of-domain benchmarks. When further ensembled with the original 72B base model, it achieves even higher precision, providing more accurate reward signals—crucial for effective reinforcement learning of the planner agent.

Appendix B Prompt Details.

We provide detailed prompt for task generator in Fig.5 and judge system in Fig.6. Detailed prompt for planner agent is in Fig.7. Prompt we used for executor agent aligns with UI-TARS Qin et al. (2025) official code.

Appendix C Virtual Machine System Details

We utilized a local cluster consisting of 15 servers to collect interaction trajectories. Among these, 13 servers were equipped with AMD EPYC 7742 processors, and 2 servers were equipped with Intel i9-13900K CPUs paired with NVIDIA GeForce RTX 4090 GPUs to support software with high graphical computing demands, such as ChimeraX. Using VMware Workstation Pro, we ran 4 to 8 independent virtual machines concurrently on each server to execute tasks in parallel.

Bibliography66

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agashe et al. (2024) Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human. ar Xiv preprint ar Xiv:2410.08164 , 2024.
2Agashe et al. (2025) Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s 2: A compositional generalist-specialist framework for computer use agents. ar Xiv preprint ar Xiv:2504.00906 , 2025.
3Anthropic (2024) Anthropic. Claude computer use. 2024. URL https://www.anthropic.com/news/3-5-models-and-computer-use .
4Anthropic (2025) Anthropic. Claude’s extended thinking. 2025. URL https://www.anthropic.com/research/visible-extended-thinking .
5Bai et al. (2024) Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems , 37:12461–12495, 2024.
6Bai et al. (2025) Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen 2. 5-vl technical report. ar Xiv preprint ar Xiv:2502.13923 , 2025.
7Carta et al. (2023) Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding large language models in interactive environments with online reinforcement learning. In International Conference on Machine Learning , pp. 3676–3713. PMLR, 2023.
8Chen et al. (2020) Xinyue Chen, Zijian Zhou, Zheng Wang, Che Wang, Yanqiu Wu, and Keith Ross. Bail: Best-action imitation learning for batch deep reinforcement learning. Advances in Neural Information Processing Systems , 33:18353–18363, 2020.