GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models

Shaokang Wang; Pei Fu; Ruoceng Zhang; Shaojie Zhang; Xiuwen Xi; Jiahui Yang; Bin Qin; Ying Huang; Zhenbo Luo; Jian Luan

arXiv:2601.18197·cs.AI·January 27, 2026

GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models

Shaokang Wang, Pei Fu, Ruoceng Zhang, Shaojie Zhang, Xiuwen Xi, Jiahui Yang, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan

PDF

Open Access 3 Reviews

TL;DR

GAIA introduces a training framework that iteratively improves GUI agent performance by using a critic model to evaluate and refine actions, enabling self-improvement and better handling of errors during task execution.

Contribution

The paper presents GAIA, a novel data flywheel system that trains an intuitive critic model to enhance GUI agent performance through iterative self-improvement cycles.

Findings

01

ICM improves test-time performance of GUI agents

02

Performance increases as data is recycled through the system

03

Effective on both open-source and closed-source models

Abstract

While Large Vision-Language Models (LVLMs) have significantly advanced GUI agents' capabilities in parsing textual instructions, interpreting screen content, and executing tasks, a critical challenge persists: the irreversibility of agent operations, where a single erroneous action can trigger catastrophic deviations. To address this, we propose the GUI Action Critic's Data Flywheel System (GAIA), a training framework that enables the models to have iterative critic capabilities, which are used to improve the Test-Time Scaling (TTS) of basic GUI agents' performance. Specifically, we train an Intuitive Critic Model (ICM) using positive and negative action examples from a base agent first. This critic evaluates the immediate correctness of the agent's intended actions, thereby selecting operations with higher success probability. Then, the initial critic guides agent actions to collect…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The motivation is good. The irreversibility of agent operations is a primary obstacle to deploying GUI agents in real-world, high-stakes environments. A single error can be catastrophic, and a pre-execution validation mechanism is a practical and necessary solution. 2. The data flywheel with the help of pre-trained GUI models is crucial. By using the actual errors generated by a base agent, the method can curate a dataset of positive and negative samples that are closely aligned with the actu

Weaknesses

1. One main claim of the paper is the data flywheel effect. However, according to the experiment, round two training of the critic model does not lead to steady performance gain. In a lot of dimensions, ICM shows even better performance than ICM-r2. The flywheel appears to stall after a single turn. 2. The experimental setup simplify the operation for the base agents by discarding excessive historical image input and only feeding the text description of the historical steps. This simplification

Reviewer 02Rating 6Confidence 4

Strengths

- Originality: Proposes a practical “data flywheel” for critic training using real agent actions rather than heuristic negatives, better matching the true error distribution (Section 3.2.1; Figure 2). The “intuitive” binary critic is a focused design choice that reduces token overhead versus reasoning critics and fits the best-of-N TTS paradigm (Sections 1, 3.2.2). - Quality: Solid empirical study across multiple agents and benchmarks, including closed-source models via API (Section 4.1). Result

Weaknesses

- Novelty relative to contemporaries: While the data flywheel with real negative samples is valuable, the high-level idea—critic-guided best-of-N test-time scaling—has appeared in GUI/agent works (e.g., GUI-Genie, GUI-Actor, GTA1, GUI-Critic-R1). The paper’s comparison focuses mainly on UI-Genie-RM (Table 4) and an in-house reasoning critic; broader, controlled comparisons to recent critics/TTT methods are limited. - Labeling assumptions and potential noise: Positives/negatives are defined by ex

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper addresses a clear challenge in GUI agents, namely the irreversibility of erroneous actions, and attempts to mitigate catastrophic deviations through a critic-guided mechanism. 2. The proposed GAIA framework is clearly structured, with the notion of a data flywheel and iterative critic model (ICM, ICM-r2) presented in a systematic way. 3. The integration with Test-Time Scaling (TTS) and the Best-of-N strategy is straightforward and easy to follow, showing how the critic can filter

Weaknesses

1. The overall contribution of GAIA appears somewhat incremental, since the Intuitive Critic Model (ICM) and the data flywheel mainly combine existing elements such as Test-Time Scaling and iterative filtering. 2. The framework is more system- and process-oriented, presenting an engineering pipeline rather than offering a fundamentally novel algorithmic or theoretical contribution. This limits the originality of the work in an academic sense. 3. The description of the “data flywheel” remains r

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)