AlignGPT: Multi-modal Large Language Models with Adaptive Alignment   Capability

Fei Zhao; Taotian Pang; Chunhui Li; Zhen Wu; Junjie Guo; Shangyu Xing,; Xinyu Dai

arXiv:2405.14129·cs.CL·November 26, 2024·2 cites

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing,, Xinyu Dai

PDF

Open Access 4 Models

TL;DR

AlignGPT introduces a novel approach to multimodal large language models by adaptively learning and combining different levels of image-text alignment, improving performance across multiple benchmarks.

Contribution

The paper presents a new training paradigm that models varying degrees of alignment during pre-training and adapts to task-specific alignment needs during instruction tuning.

Findings

01

Achieves competitive results on 12 benchmarks.

02

Models different alignment levels during pre-training.

03

Adapts alignment representations for diverse tasks.

Abstract

Multimodal Large Language Models (MLLMs) are widely regarded as crucial in the exploration of Artificial General Intelligence (AGI). The core of MLLMs lies in their capability to achieve cross-modal alignment. To attain this goal, current MLLMs typically follow a two-phase training paradigm: the pre-training phase and the instruction-tuning phase. Despite their success, there are shortcomings in the modeling of alignment capabilities within these models. Firstly, during the pre-training phase, the model usually assumes that all image-text pairs are uniformly aligned, but in fact the degree of alignment between different image-text pairs is inconsistent. Secondly, the instructions currently used for finetuning incorporate a variety of tasks and different tasks usually require different levels of alignment capabilities, but previous MLLMs overlook these differentiated alignment needs. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems