Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

Yu Wang; Sharon Li

arXiv:2604.13403·cs.CV·April 16, 2026

Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

Yu Wang, Sharon Li

PDF

1 Repo

TL;DR

This paper systematically analyzes multimodal in-context learning, revealing its limitations in transfer and alignment, and proposes an inference-stage enhancement to improve task mapping transfer.

Contribution

It provides a detailed decomposition of multimodal ICL, identifies key bottlenecks, and introduces a simple method to enhance task mapping transfer during inference.

Findings

01

Multimodal ICL performs comparably to text-only ICL in zero-shot but degrades in few-shot settings.

02

Models lack reasoning-level alignment between visual and textual representations.

03

A simple inference-stage enhancement improves task mapping transfer.

Abstract

In-context learning (ICL) enables models to adapt to new tasks via inference-time demonstrations. Despite its success in large language models, the extension of ICL to multimodal settings remains poorly understood in terms of its internal mechanisms and how it differs from text-only ICL. In this work, we conduct a systematic analysis of ICL in multimodal large language models. Using identical task formulations across modalities, we show that multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. To understand this gap, we decompose multimodal ICL into task mapping construction and task mapping transfer, and analyze how models establish cross-modal task mappings, and transfer them to query samples across layers. Our analysis reveals that current models lack reasoning-level alignment between visual and textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.