Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Yufei Zhan; Shurong Zheng; Yousong Zhu; Hongyin Zhao; Fan Yang; Ming Tang; Jinqiao Wang

arXiv:2403.09333·cs.CV·August 12, 2025·1 cites

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Yufei Zhan, Shurong Zheng, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang

PDF

Open Access 1 Repo 2 Datasets

TL;DR

Griffon v2 is a high-resolution multimodal model that enhances object perception and referring capabilities by scaling image resolution and integrating visual-language co-referring, surpassing existing models in complex scenarios.

Contribution

Introduces Griffon v2, a unified high-resolution multimodal model with a lightweight down-sampling projector and visual-language co-referring, enabling better perception and interaction.

Findings

01

Achieves state-of-the-art results on REC and phrase grounding.

02

Outperforms expert models in object detection and counting.

03

Effectively localizes objects with visual and textual references.

Abstract

Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpassing the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, counting, \textit{etc}. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scale up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details and significantly improves multimodal perception ability, especially for small objects. Building upon this,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jefferyzhan/griffon
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling