True Multimodal In-Context Learning Needs Attention to the Visual Context

Shuo Chen; Jianzhe Liu; Zhen Han; Yan Xia; Daniel Cremers; Philip Torr; Volker Tresp; Jindong Gu

arXiv:2507.15807·cs.CV·August 7, 2025

True Multimodal In-Context Learning Needs Attention to the Visual Context

Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a new fine-tuning method called DARA and a dedicated dataset, TrueMICL, to improve multimodal in-context learning in large language models by encouraging better visual context utilization.

Contribution

The paper proposes DARA, a fine-tuning strategy, and introduces TrueMICL, a dataset designed to evaluate true multimodal in-context learning capabilities.

Findings

01

DARA significantly improves visual attention in MICL models.

02

TrueMICL effectively evaluates genuine multimodal understanding.

03

Models trained with DARA outperform baselines on TrueMICL.

Abstract

Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ShuoChen99/TrueMICL
dataset· 64 dl
64 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis