TL;DR
This paper reveals a two-stage process in GUI grounding with VLMs, emphasizing the importance of the prefill stage, and introduces Re-Prefill, a training-free method that improves candidate selection through an attention-guided second prefill.
Contribution
The paper identifies the prefill stage as critical in GUI grounding and proposes Re-Prefill, a novel inference technique that enhances candidate selection without additional training.
Findings
Re-Prefill improves performance across multiple VLMs and benchmarks.
Errors in candidate selection during prefill cannot be corrected in decoding.
Re-Prefill achieves up to 4.3% gains on ScreenSpot-Pro.
Abstract
Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
