Semantic Alignment for Multimodal Large Language Models
Tao Wu, Mengze Li, Jingyuan Chen, Wei Ji, Wang Lin, Jinyang Gao, Kun, Kuang, Zhou Zhao, Fei Wu

TL;DR
This paper introduces SAM, a semantic alignment method for Multi-modal Large Language Models, which improves cross-image semantic coherence by bidirectional guidance during visual token extraction, significantly enhancing multi-image understanding tasks.
Contribution
The paper proposes a novel semantic alignment approach that incorporates bidirectional semantic guidance to better preserve linking information among images in MLLMs.
Findings
SAM outperforms state-of-the-art methods by +37% on group captioning and +22% on storytelling tasks.
The proposed dataset MmLINK contains 69K diverse multi-modal instruction samples.
Extensive experiments demonstrate the effectiveness of SAM in multi-image understanding.
Abstract
Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images (e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · ALIGN · Segment Anything Model
