ParGo: Bridging Vision-Language with Partial and Global Views
An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang,, Lei Liao, Can Huang, Jingqun Tang, Wei-Shi Zheng

TL;DR
ParGo introduces a novel Partial-Global projector that effectively bridges vision and language modalities in multimodal models by integrating global and partial views, improving alignment and detail perception.
Contribution
The paper proposes ParGo, a new Partial-Global projector for vision-language models, and introduces ParGoCap-1M-PT, a large-scale dataset for training and evaluation.
Findings
ParGo outperforms traditional projectors in MME benchmark by 259.96 points.
ParGo significantly enhances detail perception in vision-language tasks.
Extensive experiments validate ParGo's superiority across multiple benchmarks.
Abstract
This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReligious Tourism and Spaces · Biblical Studies and Interpretation · Historical and Linguistic Studies
