Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression
Bowen Zhou, Zhou Xu, Wanli Li, Jingyu Xiao, Haoqian Wang

TL;DR
This paper introduces ST-Lite, a training-free KV cache compression method for GUI agents that significantly reduces memory and latency while maintaining performance, by leveraging the uniform high-sparsity of GUI attention patterns.
Contribution
ST-Lite is a novel, training-free framework tailored for GUI agents that explicitly models spatio-trajectory dependencies using a dual-branch scoring policy.
Findings
Achieves 2.45x decoding acceleration with only 10-20% cache budget.
Maintains or improves performance compared to full-cache baselines.
Addresses the unique uniform high-sparsity attention pattern in GUI scenarios.
Abstract
Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically demonstrate that they suffer from suboptimal performance in GUI scenarios due to a fundamental misalignment: unlike general visual tasks where attention sparsity varies across layers, GUI attention patterns exhibit uniform high-sparsity across all transformer layers. Motivated by this insight, we propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams. ST-Lite introduces a novel dual-branch scoring policy incorporating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection
