Multi-layer Cross-attention is Provably Optimal for Multi-modal In-context Learning

Nicholas Barnfield; Subhabrata Sen; Pragya Sur

arXiv:2602.04872·stat.ML·May 19, 2026

Multi-layer Cross-attention is Provably Optimal for Multi-modal In-context Learning

Nicholas Barnfield, Subhabrata Sen, Pragya Sur

PDF

TL;DR

This paper establishes that multi-layer cross-attention mechanisms are provably optimal for multi-modal in-context learning, demonstrating the importance of depth and cross-attention in achieving Bayes-optimal performance.

Contribution

The paper introduces a mathematically tractable framework for multi-modal in-context learning and proves the optimality of multi-layer cross-attention mechanisms under this model.

Findings

01

Single-layer linear self-attention cannot recover Bayes-optimal predictor.

02

Linearized multi-layer cross-attention is provably Bayes optimal with gradient flow.

03

Depth enhances the in-context learning capability of attention-based models.

Abstract

Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.