Can Multimodal Large Language Models Truly Perform Multimodal In-Context   Learning?

Shuo Chen; Zhen Han; Bailan He; Jianzhe Liu; Mark Buckley; Yao Qin,; Philip Torr; Volker Tresp; Jindong Gu

arXiv:2311.18021·cs.CV·December 10, 2024·5 cites

Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?

Shuo Chen, Zhen Han, Bailan He, Jianzhe Liu, Mark Buckley, Yao Qin,, Philip Torr, Volker Tresp, Jindong Gu

PDF

Open Access

TL;DR

This paper investigates whether multimodal large language models truly perform multimodal in-context learning or if their success is mainly driven by textual content, proposing a new demo selection method to enhance performance.

Contribution

The study reveals that multimodal ICL is primarily influenced by textual information and introduces MMICES, a demo selection strategy considering both visual and textual modalities.

Findings

01

Multimodal ICL is mainly driven by textual content.

02

Visual content aids in selecting better demos.

03

Proposed MMICES improves ICL performance.

Abstract

Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos). Recently, Multimodal Large Language Models (MLLMs) built upon LLMs have also shown multimodal ICL ability, i.e., responding to queries given a few multimodal demos, including images, queries, and answers. While ICL has been extensively studied on LLMs, its research on MLLMs remains limited. One essential question is whether these MLLMs can truly conduct multimodal ICL, or if only the textual modality is necessary. We investigate this question by examining two primary factors that influence ICL: 1) Demo content, i.e., understanding the influences of demo content in different modalities. 2) Demo selection strategy, i.e., how to select better multimodal demos for improved performance. Experiments revealed that multimodal ICL is predominantly driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning