AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning
Jun Gao, Qian Qiao, Ziqiang Cao, Zili Wang, Wenjie Li

TL;DR
AIM introduces a lightweight framework that enables multi-modal large language models to perform efficient in-context learning by converting visual information into fused virtual tokens, reducing reliance on multi-modal training data.
Contribution
The paper proposes a novel method to incorporate multi-modal demonstrations into LLMs by transforming image-text pairs into fused tokens, enhancing multi-modal ICL without retraining the core model.
Findings
AIM effectively enables multi-modal ICL with minimal additional training.
The framework improves performance on multi-modal tasks without modifying the LLM.
AIM is compatible with any frozen MLLM and trained on public multi-modal data.
Abstract
In-context learning (ICL) facilitates Large Language Models (LLMs) exhibiting emergent ability on downstream tasks without updating billions of parameters. However, in the area of multi-modal Large Language Models (MLLMs), two problems hinder the application of multi-modal ICL: (1) Most primary MLLMs are only trained on single-image datasets, making them unable to read multi-modal demonstrations. (2) With the demonstrations increasing, thousands of visual tokens highly challenge hardware and degrade ICL performance. During preliminary explorations, we discovered that the inner LLM tends to focus more on the linguistic modality within multi-modal demonstrations to generate responses. Therefore, we propose a general and light-weighted framework \textbf{AIM} to tackle the mentioned problems through \textbf{A}ggregating \textbf{I}mage information of \textbf{M}ultimodal demonstrations to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsFocus
