Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models

Jiachen Jiang; Jinxin Zhou; Bo Peng; Xia Ning; Zhihui Zhu

arXiv:2505.17316·cs.CV·May 26, 2025

Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models

Jiachen Jiang, Jinxin Zhou, Bo Peng, Xia Ning, Zhihui Zhu

PDF

1 Video

TL;DR

This paper investigates how vision embeddings are aligned with language models in multimodal systems, revealing the role of projectors and proposing patch-aligned training to improve visual understanding and downstream task performance.

Contribution

It introduces patch-aligned training to enhance patch-level alignment and compression in multimodal models, leading to better captioning and task performance.

Findings

01

Improved patch-level alignment with patch-aligned training

02

Enhanced caption quality and visual understanding

03

Significant performance gains on grounding and QA tasks

Abstract

Achieving better alignment between vision embeddings and Large Language Models (LLMs) is crucial for enhancing the abilities of Multimodal LLMs (MLLMs), particularly for recent models that rely on powerful pretrained vision encoders and LLMs. A common approach to connect the pretrained vision encoder and LLM is through a projector applied after the vision encoder. However, the projector is often trained to enable the LLM to generate captions, and hence the mechanism by which LLMs understand each vision token remains unclear. In this work, we first investigate the role of the projector in compressing vision embeddings and aligning them with word embeddings. We show that the projector significantly compresses visual information, removing redundant details while preserving essential elements necessary for the LLM to understand visual content. We then examine patch-level alignment -- the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models· slideslive