ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
A. Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Marc Pollefeys, Peter Staar

TL;DR
ScreenParse introduces a comprehensive dataset and a specialized vision language model for detailed UI screen parsing, enhancing accuracy and transferability in understanding visual web elements.
Contribution
The paper presents a large-scale, densely annotated UI dataset and a compact vision language model trained on it, advancing complete screen parsing capabilities.
Findings
ScreenVLM outperforms larger models on dense parsing tasks.
Finetuning foundation models on ScreenParse improves their grounding performance.
ScreenParse enables better transfer to public UI benchmarks.
Abstract
Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
