What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

Jiaping Lin; Fei Shen; Junzhe Li; Ping Nie; Fei Yu; Ming Li; Haizhou Li

arXiv:2605.12549·cs.CV·May 14, 2026

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

Jiaping Lin, Fei Shen, Junzhe Li, Ping Nie, Fei Yu, Ming Li, Haizhou Li

PDF

1 Repo

TL;DR

This paper reveals a two-stage process in GUI grounding with VLMs, emphasizing the importance of the prefill stage, and introduces Re-Prefill, a training-free method that improves candidate selection through an attention-guided second prefill.

Contribution

The paper identifies the prefill stage as critical in GUI grounding and proposes Re-Prefill, a novel inference technique that enhances candidate selection without additional training.

Findings

01

Re-Prefill improves performance across multiple VLMs and benchmarks.

02

Errors in candidate selection during prefill cannot be corrected in decoding.

03

Re-Prefill achieves up to 4.3% gains on ScreenSpot-Pro.

Abstract

Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

linjiaping1/Re-Prefill
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.