Bayesian Optimality of In-Context Learning with Selective State Spaces
Di Zhang, Jiaqi Xing

TL;DR
This paper introduces Bayesian optimal sequential prediction as a new framework for understanding in-context learning, demonstrating that selective state space models asymptotically achieve Bayes-optimality and outperform gradient-based methods in certain tasks.
Contribution
It formalizes in-context learning as meta-learning over latent tasks and proves selective state space models attain Bayes-optimal predictions, providing a new theoretical foundation for model efficiency.
Findings
Selective SSMs converge faster to Bayes-optimal risk.
Selective SSMs show superior sample efficiency in structured-noise tasks.
Transformers more robustly track latent states than linear models.
Abstract
We propose Bayesian optimal sequential prediction as a new principle for understanding in-context learning (ICL). Unlike interpretations framing Transformers as performing implicit gradient descent, we formalize ICL as meta-learning over latent sequence tasks. For tasks governed by Linear Gaussian State Space Models (LG-SSMs), we prove a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean. We further establish a statistical separation from gradient descent, constructing tasks with temporally correlated noise where the optimal Bayesian predictor strictly outperforms any empirical risk minimization (ERM) estimator. Since Transformers can be seen as performing implicit ERM, this demonstrates selective SSMs achieve lower asymptotic risk due to superior statistical efficiency. Experiments on synthetic LG-SSM tasks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
