Is Mamba Capable of In-Context Learning?
Riccardo Grazzi, Julien Siems, Simon Schrodi, Thomas Brox, Frank, Hutter

TL;DR
This paper demonstrates that Mamba, a state space model, exhibits in-context learning capabilities comparable to transformers, especially for long input sequences, offering an efficient alternative for such tasks.
Contribution
The work provides empirical evidence that Mamba can perform in-context learning similarly to transformers, extending ICL capabilities to a more scalable model.
Findings
Mamba matches transformer performance in ICL tasks.
Mamba effectively handles long input sequences.
ICL in Mamba involves incremental internal optimization.
Abstract
State of the art foundation models such as GPT-4 perform surprisingly well at in-context learning (ICL), a variant of meta-learning concerning the learned ability to solve tasks during a neural network forward pass, exploiting contextual information provided as input to the model. This useful ability emerges as a side product of the foundation model's massive pretraining. While transformer models are currently the state of the art in ICL, this work provides empirical evidence that Mamba, a newly proposed state space model which scales better than transformers w.r.t. the input sequence length, has similar ICL capabilities. We evaluated Mamba on tasks involving simple function approximation as well as more complex natural language processing problems. Our results demonstrate that, across both categories of tasks, Mamba closely matches the performance of transformer models for ICL. Further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducation and Technology Integration
MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Dense Connections · Label Smoothing
