Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

Yiran Huang; Karsten Roth; Quentin Bouniot; Wenjia Xu; Zeynep Akata

arXiv:2601.20796·cs.CL·January 29, 2026

Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata

PDF

Open Access

TL;DR

This paper investigates how multimodal transformers learn to associate information across modalities through in-context learning, revealing asymmetries and circuit dynamics that underpin this ability.

Contribution

It introduces a controlled experimental framework to analyze multimodal ICL, uncovering asymmetries and circuit mechanisms that extend unimodal principles to multimodal settings.

Findings

01

Rotary Position Embeddings increase data complexity threshold for ICL

02

Low data complexity in secondary modality suffices for multimodal ICL

03

Multimodal training refines circuits that copy labels across modalities

Abstract

Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Language and cultural evolution · Domain Adaptation and Few-Shot Learning