Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers
Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata

TL;DR
This paper investigates how multimodal transformers learn to associate information across modalities through in-context learning, revealing asymmetries and circuit dynamics that underpin this ability.
Contribution
It introduces a controlled experimental framework to analyze multimodal ICL, uncovering asymmetries and circuit mechanisms that extend unimodal principles to multimodal settings.
Findings
Rotary Position Embeddings increase data complexity threshold for ICL
Low data complexity in secondary modality suffices for multimodal ICL
Multimodal training refines circuits that copy labels across modalities
Abstract
Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Language and cultural evolution · Domain Adaptation and Few-Shot Learning
