AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

Yuankai Li; Tinghui Zhu; Ha Min Son; Zhe Zhao; Xin Liu; Muhao Chen

arXiv:2605.19260·cs.AI·May 20, 2026

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

Yuankai Li, Tinghui Zhu, Ha Min Son, Zhe Zhao, Xin Liu, Muhao Chen

PDF

TL;DR

AQuaUI is a training-free, adaptive quadtree-based method that reduces visual tokens in GUI screenshots, improving efficiency while maintaining high accuracy in GUI agent models.

Contribution

It introduces a novel inference-time token reduction technique using adaptive quadtrees that exploits spatial redundancy in GUI images without retraining.

Findings

01

Achieves up to 13.22% speedup in GUI agent inference.

02

Reduces visual tokens by 29.52% while retaining 99.06% of full-token performance.

03

Improves accuracy-efficiency trade-offs over prior methods.

Abstract

Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.