Semantic Alignment for Multimodal Large Language Models

Tao Wu; Mengze Li; Jingyuan Chen; Wei Ji; Wang Lin; Jinyang Gao; Kun; Kuang; Zhou Zhao; Fei Wu

arXiv:2408.12867·cs.CV·August 26, 2024

Semantic Alignment for Multimodal Large Language Models

Tao Wu, Mengze Li, Jingyuan Chen, Wei Ji, Wang Lin, Jinyang Gao, Kun, Kuang, Zhou Zhao, Fei Wu

PDF

Open Access

TL;DR

This paper introduces SAM, a semantic alignment method for Multi-modal Large Language Models, which improves cross-image semantic coherence by bidirectional guidance during visual token extraction, significantly enhancing multi-image understanding tasks.

Contribution

The paper proposes a novel semantic alignment approach that incorporates bidirectional semantic guidance to better preserve linking information among images in MLLMs.

Findings

01

SAM outperforms state-of-the-art methods by +37% on group captioning and +22% on storytelling tasks.

02

The proposed dataset MmLINK contains 69K diverse multi-modal instruction samples.

03

Extensive experiments demonstrate the effectiveness of SAM in multi-image understanding.

Abstract

Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images (e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · ALIGN · Segment Anything Model