BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models
Jianting Tang, Yubo Wang, Haoyu Cao, Linli Xu

TL;DR
BASIC introduces a novel approach for visual alignment in multimodal large language models by directly supervising visual embeddings, leading to significant performance improvements without extra annotations.
Contribution
The paper proposes a new method called BASIC that refines visual embeddings through direct supervision, enhancing alignment and understanding in MLLMs.
Findings
Improves MLLM performance across multiple benchmarks.
Enhances semantic matching of visual embeddings.
Does not require additional supervisory models or annotations.
Abstract
Mainstream Multimodal Large Language Models (MLLMs) achieve visual understanding by using a vision projector to bridge well-pretrained vision encoders and large language models (LLMs). The inherent gap between visual and textual modalities makes the embeddings from the vision projector critical for visual comprehension. However, current alignment approaches treat visual embeddings as contextual cues and merely apply auto-regressive supervision to textual outputs, neglecting the necessity of introducing equivalent direct visual supervision, which hinders the potential finer alignment of visual embeddings. In this paper, based on our analysis of the refinement process of visual embeddings in the LLM's shallow layers, we propose BASIC, a method that utilizes refined visual embeddings within the LLM as supervision to directly guide the projector in generating initial visual embeddings.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
