GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

Chen Chen; Jiawei Shao; Dakuan Lu; Haoyi Hu; Xiangcheng Liu; Hantao Yao; Wu Liu

arXiv:2601.09770·cs.AI·January 16, 2026

GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

Chen Chen, Jiawei Shao, Dakuan Lu, Haoyi Hu, Xiangcheng Liu, Hantao Yao, Wu Liu

PDF

Open Access

TL;DR

GUI-Eyes introduces an active perception framework for GUI agents that strategically uses visual tools and staged reasoning to improve accuracy and data efficiency in GUI understanding tasks.

Contribution

The paper proposes a novel RL-based framework with a two-stage perception strategy and a dense reward function for active visual perception in GUI tasks.

Findings

01

Achieves 44.8% grounding accuracy on ScreenSpot-Pro with only 3k labeled samples.

02

Outperforms supervised and RL baselines significantly.

03

Demonstrates the effectiveness of tool-aware active perception for GUI understanding.

Abstract

Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to adaptively determine when, whether, and how to observe the interface. We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks. To acquire more informative observations, the agent learns to make strategic decisions on both whether and how to invoke visual tools, such as cropping or zooming, within a two-stage reasoning process. To support this behavior, we introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding, coordinated by a two-level policy. In addition, we design a spatially continuous reward function tailored to tool usage, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics