MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Ag\"uera y Arcas

TL;DR
MesaNet introduces a test-time training approach with a locally optimal layer that improves language modeling performance, especially on long-context tasks, by solving sequential optimization problems during inference.
Contribution
It presents a numerically stable, chunkwise parallelizable version of the Mesa layer that minimizes an in-context loss at each time step using conjugate gradient, enhancing language modeling.
Findings
Lower perplexity in language modeling tasks.
Higher downstream benchmark performance.
Effective on tasks requiring long context understanding.
Abstract
Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to…
Peer Reviews
Decision·ICLR 2026 Poster
First, I stress that I liked this paper a lot and I'm excited about the latest developments in the field of linear-time Transformer alternatives, and in particular the direction where this and related works are steering the field. * The proposed method (online learning objective, corresponding update rule, chunk-wise and recurrent implementation algorithms) is novel and original. * The paper is very rich in content, be it new theory, connections to other literature, and especially experiments
The below comments do not represent major weaknesses and don’t denigrate the quality of the paper. They aim for improvement of presentation or consider the pieces of text which the authors could just omit without detriment of the exposition. 1. Please state in the beginning of the paper that all vectors are column vectors to avoid confusion because many related works such as GLA or Gated DeltaNet use row vectors as a convention. 2. Lines 121-123: How do you derive the expression $\gamma_t \Ph
* Clear objective with a closed-form fast-weight map and practical CG solve; principled dynamic test-time compute via stopping criteria. * Strong empirical controls (same backbone/tokenizer/data order) enabling clean comparisons. * Useful diagnostics: early-sequence NLL gains, length extrapolation, grouped tasks; dynamic stopping achieves near-parity to larger fixed $k$ at substantially fewer steps on average.
* Missing a concise accuracy–efficiency summary at inference (for one representative long-sequence setting), including per-token latency and peak memory under fixed batch, hardware, and precision. * Stability/conditioning analysis is qualitative; small ablations on the softplus scale for $\\Lambda$ and the diagonal preconditioner or $x_0$ initializer are needed to establish sensitivity and recommend defaults.
1. The proposed method is well-motivated. It presents a numerically stable Mesa layer that solves $q^*_t=(H_t + \Lambda)^{-1} q_t$ per timestep via a CG solver with gated state recurrences, yielding a well-posed recurrent formulation. 2. Results are competitive with strong RNN-style baselines and broadly comparable to a Transformer at similar scale, and the accompanying analysis is generally sound.
1. The paper shows per-layer timings and training throughput (Fig. 2), but there’s no single, end-to-end table that reports latency (ms/token), tokens/s, and GPU memory alongside quality across CG step counts or the stopping policy, and across multiple context lengths on the same hardware. 2. It’s unclear when to prefer Mesa over MHA. Although the paper acknowledges that compute grows with CG steps and may exceed MHA past some step/key sizes, it lacks concrete recipes or tables mapping k/toleran
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
