EGM: Efficient Visual Grounding Language Models
Guanqi Zhan, Changye Li, Zhijian Liu, Yao Lu, Yi Wu, Song Han, Ligeng Zhu

TL;DR
EGM introduces a method to enhance small visual grounding language models by generating mid-quality tokens, achieving performance comparable to larger models with significantly improved efficiency and faster inference times.
Contribution
The paper proposes a novel token generation approach that allows small VLMs to match large models' performance in visual grounding tasks, improving deployment efficiency.
Findings
EGM-Qwen3-VL-8B achieves 91.4 IoU with 737ms latency.
EGM reduces inference time by 5.9x compared to larger models.
Method improves small models' performance in both vanilla and amodal grounding.
Abstract
Visual grounding is an essential capability of Visual Language Models (VLMs) to understand the real physical world. Previous state-of-the-art grounding visual language models usually have large model sizes, making them heavy for deployment and slow for inference. However, we notice that the sizes of visual encoders are nearly the same for small and large VLMs and the major difference is the sizes of the language models. Small VLMs fall behind larger VLMs in grounding because of the difference in language understanding capability rather than visual information handling. To mitigate the gap, we introduce 'Efficient visual Grounding language Models' (EGM): generate many mid-quality tokens (from small models) to match the performance of large VLMs with few high-quality but expensive tokens. This method is deployment-friendly, and yields better end-to-end latency: On the RefCOCO benchmark,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
