TL;DR
This paper introduces XBOUND, a new state-level evaluation framework for device-control agents managing GUIs, revealing insights into model performance, limitations, and the impact of grounding data.
Contribution
The paper proposes XBOUND, a novel evaluation method that assesses device-control agents at the state level, providing a more comprehensive performance analysis than instruction-level methods.
Findings
UI-TARS is the strongest 7B model.
Current agents show bimodal performance in instruction unification.
Sub-7B models have limited state mastery.
Abstract
Recent advancements in vision-language models have increased interest in Device-Control Agents (DC agents) for managing graphical user interfaces (GUIs). With the growing complexity and integration of such agents into various applications, effective evaluation methods have become crucial. The current evaluation method for DC agents primarily focuses on the instruction level, providing the current state (e.g., screenshots) and past execution history to determine actions for target instructions, helping identify potential execution failures. However, in GUI environments, a single state may contain multiple interactive widgets, each linked to different instructions, presenting an opportunity for diverse actions based on various instruction targets. Evaluating the agent's performance solely at the instruction level may overlook the broader context of these interactions. To capture a more…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Clear reframing of evaluation. Shifting to state-level thinking is useful. It better reflects how real agentic situations present multiple, competing targets and how one state can support many valid instructions. This framing makes failure modes easier to spot (e.g., “keeps clicking the search box even when the camera icon is the goal”). 2. Action Matching vs Instruction Unification is simple and useful distinction -- even if not worded in simple terms. It distinguishes perception errors (ca
1. The paper does not offer training recipes for reliable direct control (e.g. ways to recover from off-policy drift seems to an important challenge here). 2. While many systems are compared, the discussion rarely digs into why a given architecture fails a given state beyond labeling it “planning” vs “grounding.” There’s little ablation on perception modules, planners, memory, or search -- it is unclear where the bottlenecks are, for example with perfect perception of icon knowledge, would the a
The paper tackles an important and timely problem in DCA evaluation. The shift in focus from binary task success/failure to a nuanced, state-centric analysis is a commendable and necessary direction for the community. The conceptual division of agent capabilities into Multi-Widget Action Matching (MWAM) and Uni-Widget Instruction Unification (UWIU) is well-founded, as it captures two distinct and essential skills for robust GUI interaction. The experimental work is extensive, involving a systema
The paper does not adequately justify the advantage of its proposed Exploration Metric (EM) over established instruction-level or task-level success rates. It claims that existing methods overlook broader contextual interactions but fails to provide a clear, quantitative, or qualitative comparison showing how XBOUND offers a superior or more informative assessment. The explanation of the metric itself, particularly the supporting figure, is confusing and difficult to understand, hindering the re
1. Highlights a useful state-level evaluation perspective beyond instruction-level metrics. 2. Defines two clear capability types (MWAM/UWIU) with interpretable results. 3. Provides broad empirical comparison across 11 agents and multiple tasks.
1. Method is simple — essentially averaging per-state accuracy, not a fundamentally new metric. 2. Data quality concerns: large portions of the dataset are LLM-generated without clear validation. 3. Limited scope: experiments cover only Android mobile GUIs. 4. No link to real performance: unclear if state-level scores predict full-task success. 5. Claims overstated: conclusions about bottlenecks or superiority lack ablation or statistical support.
1. This paper identifies an important limitation of current evaluation methods that operate only at the instruction level and motivates the need for state-level analysis in device-control agents. 2. It introduces two concrete evaluation scenarios, Multi-Widget Action Matching and Uni-Widget Instruction Unification, which capture distinct challenges faced by agents in realistic GUI environments. 3. The Exploration Metric provides an intuitive and interpretable way to quantify the extent of stat
1. The novelty of XBOUND lies mainly in shifting the evaluation perspective to the state level. This is useful, but the conceptual advance over existing evaluation methods could be articulated more clearly. 2. The experiments section covers 11 DC agents, yet the choice of only two scenarios may not fully capture the variety of challenges DC agents face in complex GUI environments. 3. GPT-based planning is highlighted as a bottleneck, but the discussion of why it arises and how general this lim
1. The paper introduces a new evaluation paradigm that shifts the focus from instruction-level accuracy to state-level understanding, providing a more holistic assessment of GUI agents’ real capabilities. 2. The proposed XBOUND framework, with its two complementary tasks (MWAM and UWIU) and the Exploration Metric, offers a systematic and interpretable way to analyze model performance across different capability dimensions. 3. The authors conduct extensive experiments on a diverse set of device-c
1. The XBOUND framework shows limited scalability and representativeness when applied to complex applications such as shopping or map apps. The evaluation only identifies visible and clickable UI elements but does not specify how coverage or selection is handled when a screen contains many interactive components, leaving representativeness unclear. Moreover, the framework lacks mechanisms for handling multi-step interaction flows and focuses solely on static states, which may lead to underestima
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
