Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Yongshuo Zong; Qin Zhang; Dongsheng An; Zhihua Li; Xiang Xu; Linghan Xu; Zhuowen Tu; Yifan Xing; Onkar Dabeer

arXiv:2505.13788·cs.CV·May 21, 2025

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Yongshuo Zong, Qin Zhang, Dongsheng An, Zhihua Li, Xiang Xu, Linghan Xu, Zhuowen Tu, Yifan Xing, Onkar Dabeer

PDF

Open Access

TL;DR

This paper introduces Ground-V, a scalable dataset and method for training vision-language models to perform pixel-level grounding of complex instructions, significantly improving accuracy on multiple benchmarks.

Contribution

We propose a knowledge distillation approach to automatically generate high-quality instruction-response pairs linked to pixel annotations, enabling effective training of grounding models without extensive human labeling.

Findings

01

Models trained on Ground-V show 4.4% accuracy improvement on LISA.

02

Achieves 7.9% higher accuracy on PSALM across six benchmarks.

03

Sets new state-of-the-art results on RefCOCO/+/g datasets.

Abstract

This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhotonic and Optical Devices · Semiconductor Lasers and Optical Devices · Advanced Fiber Optic Sensors

MethodsKnowledge Distillation