Multi-layer Cross-attention is Provably Optimal for Multi-modal In-context Learning
Nicholas Barnfield, Subhabrata Sen, Pragya Sur

TL;DR
This paper establishes that multi-layer cross-attention mechanisms are provably optimal for multi-modal in-context learning, demonstrating the importance of depth and cross-attention in achieving Bayes-optimal performance.
Contribution
The paper introduces a mathematically tractable framework for multi-modal in-context learning and proves the optimality of multi-layer cross-attention mechanisms under this model.
Findings
Single-layer linear self-attention cannot recover Bayes-optimal predictor.
Linearized multi-layer cross-attention is provably Bayes optimal with gradient flow.
Depth enhances the in-context learning capability of attention-based models.
Abstract
Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
