ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

A. Said Gurbuz; Sunghwan Hong; Ahmed Nassar; Marc Pollefeys; Peter Staar

arXiv:2602.14276·cs.CV·May 4, 2026

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

A. Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Marc Pollefeys, Peter Staar

PDF

1 Repo 2 Models 1 Datasets

TL;DR

ScreenParse introduces a comprehensive dataset and a specialized vision language model for detailed UI screen parsing, enhancing accuracy and transferability in understanding visual web elements.

Contribution

The paper presents a large-scale, densely annotated UI dataset and a compact vision language model trained on it, advancing complete screen parsing capabilities.

Findings

01

ScreenVLM outperforms larger models on dense parsing tasks.

02

Finetuning foundation models on ScreenParse improves their grounding performance.

03

ScreenParse enables better transfer to public UI benchmarks.

Abstract

Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://saidgurbuz.github.io/screenparse
github

Models

Datasets

docling-project/screenparse
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.