ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding
ZongHan Hsieh, Tzer-Jen Wei, ShengJing Yang

TL;DR
ZonUI-3B is a compact vision-language model trained on diverse GUI data, achieving high accuracy in GUI grounding tasks on standard benchmarks with minimal computational resources.
Contribution
The paper introduces ZonUI-3B, a lightweight model with innovative data strategies and a two-stage fine-tuning process for effective GUI grounding.
Findings
Achieves 84.9% on ScreenSpot benchmark.
Surpasses larger models under 4B parameters.
Effective data reduction without performance loss.
Abstract
In this paper, we present ZonUI-3B, a lightweight Vision-Language Model (VLM) that can be fully trained on a single consumer-grade GPU (RTX 4090) while delivering performance comparable to significantly larger models on GUI grounding tasks. The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
