TL;DR
This paper introduces PAGER, a geometry-aware agent that significantly improves point-precise GUI control by bridging the semantic-execution gap, enabling accurate, robust, and successful interactions in complex graphical interfaces.
Contribution
The paper presents PAGER, a novel topology-aware agent with a new benchmark PAGE Bench, and demonstrates substantial performance improvements over existing models in point-precise GUI tasks.
Findings
PAGER achieves over 62% step success rate, a 4.1x improvement over baselines.
PAGE Bench contains 4,906 problems with 224K pixel-level actions.
General models exceed 88% action accuracy but under 6% task success.
Abstract
Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
