AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
Yuankai Li, Tinghui Zhu, Ha Min Son, Zhe Zhao, Xin Liu, Muhao Chen

TL;DR
AQuaUI is a training-free, adaptive quadtree-based method that reduces visual tokens in GUI screenshots, improving efficiency while maintaining high accuracy in GUI agent models.
Contribution
It introduces a novel inference-time token reduction technique using adaptive quadtrees that exploits spatial redundancy in GUI images without retraining.
Findings
Achieves up to 13.22% speedup in GUI agent inference.
Reduces visual tokens by 29.52% while retaining 99.06% of full-token performance.
Improves accuracy-efficiency trade-offs over prior methods.
Abstract
Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
