TL;DR
This paper investigates how vision embeddings are aligned with language models in multimodal systems, revealing the role of projectors and proposing patch-aligned training to improve visual understanding and downstream task performance.
Contribution
It introduces patch-aligned training to enhance patch-level alignment and compression in multimodal models, leading to better captioning and task performance.
Findings
Improved patch-level alignment with patch-aligned training
Enhanced caption quality and visual understanding
Significant performance gains on grounding and QA tasks
Abstract
Achieving better alignment between vision embeddings and Large Language Models (LLMs) is crucial for enhancing the abilities of Multimodal LLMs (MLLMs), particularly for recent models that rely on powerful pretrained vision encoders and LLMs. A common approach to connect the pretrained vision encoder and LLM is through a projector applied after the vision encoder. However, the projector is often trained to enable the LLM to generate captions, and hence the mechanism by which LLMs understand each vision token remains unclear. In this work, we first investigate the role of the projector in compressing vision embeddings and aligning them with word embeddings. We show that the projector significantly compresses visual information, removing redundant details while preserving essential elements necessary for the LLM to understand visual content. We then examine patch-level alignment -- the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
