FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Mingyu Ouyang; Kevin Qinghong Lin; Mike Zheng Shou; Hwee Tou Ng

arXiv:2601.03928·cs.CV·January 8, 2026

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng

PDF

Open Access 3 Models 2 Datasets

TL;DR

FocusUI introduces an efficient UI grounding method that selects relevant visual tokens while maintaining positional information, significantly reducing computational costs with minimal accuracy loss.

Contribution

The paper proposes FocusUI, a novel framework that improves UI grounding efficiency by token selection and a new positional continuity strategy, outperforming existing methods.

Findings

01

Achieves 3.7% performance improvement on ScreenSpot-Pro.

02

Reduces visual token usage by 70% with only 3.2% accuracy drop.

03

Up to 1.44x faster inference and 17% lower GPU memory usage.

Abstract

Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis