AIM: Let Any Multi-modal Large Language Models Embrace Efficient   In-Context Learning

Jun Gao; Qian Qiao; Ziqiang Cao; Zili Wang; Wenjie Li

arXiv:2406.07588·cs.MM·July 2, 2024

AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning

Jun Gao, Qian Qiao, Ziqiang Cao, Zili Wang, Wenjie Li

PDF

Open Access

TL;DR

AIM introduces a lightweight framework that enables multi-modal large language models to perform efficient in-context learning by converting visual information into fused virtual tokens, reducing reliance on multi-modal training data.

Contribution

The paper proposes a novel method to incorporate multi-modal demonstrations into LLMs by transforming image-text pairs into fused tokens, enhancing multi-modal ICL without retraining the core model.

Findings

01

AIM effectively enables multi-modal ICL with minimal additional training.

02

The framework improves performance on multi-modal tasks without modifying the LLM.

03

AIM is compatible with any frozen MLLM and trained on public multi-modal data.

Abstract

In-context learning (ICL) facilitates Large Language Models (LLMs) exhibiting emergent ability on downstream tasks without updating billions of parameters. However, in the area of multi-modal Large Language Models (MLLMs), two problems hinder the application of multi-modal ICL: (1) Most primary MLLMs are only trained on single-image datasets, making them unable to read multi-modal demonstrations. (2) With the demonstrations increasing, thousands of visual tokens highly challenge hardware and degrade ICL performance. During preliminary explorations, we discovered that the inner LLM tends to focus more on the linguistic modality within multi-modal demonstrations to generate responses. Therefore, we propose a general and light-weighted framework \textbf{AIM} to tackle the mentioned problems through \textbf{A}ggregating \textbf{I}mage information of \textbf{M}ultimodal demonstrations to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsFocus