ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding

ZongHan Hsieh; Tzer-Jen Wei; ShengJing Yang

arXiv:2506.23491·cs.CV·July 21, 2025

ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding

ZongHan Hsieh, Tzer-Jen Wei, ShengJing Yang

PDF

Open Access 2 Models 3 Datasets

TL;DR

ZonUI-3B is a compact vision-language model trained on diverse GUI data, achieving high accuracy in GUI grounding tasks on standard benchmarks with minimal computational resources.

Contribution

The paper introduces ZonUI-3B, a lightweight model with innovative data strategies and a two-stage fine-tuning process for effective GUI grounding.

Findings

01

Achieves 84.9% on ScreenSpot benchmark.

02

Surpasses larger models under 4B parameters.

03

Effective data reduction without performance loss.

Abstract

In this paper, we present ZonUI-3B, a lightweight Vision-Language Model (VLM) that can be fully trained on a single consumer-grade GPU (RTX 4090) while delivering performance comparable to significantly larger models on GUI grounding tasks. The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques