Longhorn: State Space Models are Amortized Online Learners
Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, Qiang Liu

TL;DR
This paper introduces Longhorn, a novel state-space model architecture inspired by online learning principles, achieving superior efficiency and longer context extrapolation in sequence modeling tasks compared to existing models.
Contribution
The paper presents a new SSM architecture, Longhorn, derived from online learning insights, improving sequence modeling efficiency and context length handling.
Findings
Longhorn outperforms state-of-the-art SSMs like Mamba in benchmarks.
Achieves 1.8x better sample efficiency than Mamba.
Can extrapolate contexts up to 16x longer during inference.
Abstract
Modern large language models are built on sequence modeling via next-token prediction. While the Transformer remains the dominant architecture for sequence modeling, its quadratic decoding complexity in sequence length poses a major limitation. State-space models (SSMs) present a competitive alternative, offering linear decoding efficiency while maintaining parallelism during training. However, most existing SSMs rely on linear recurrence designs that appear somewhat ad hoc. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from solving these objectives. Based on this insight, we introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper is well-written and easy to follow. 2. The formulation via online learning for solving in-context associative recall is interesting and elegant. It explains why Longhorn (and also DeltaNet) performs well in MQAR tasks. 3. Empirical results look good.
1. The main issue with this work is that the implementation does not fully align with the theory. Using a diagonal matrix to approximate an “identity-plus-low-rank” dense matrix is coarse, and it’s unclear if the theoretical advantage translates to this setting. 2. In Eq5, the norm \(\text{diag}(\beta_t)\) appears unusual and is not well-motivated or empirically validated. Why is a vector-valued \(\epsilon\) necessary? If not, the DeltaNet structure could leverage the kernel from Yang et al. (20
1. The paper is well-written and easy to follow. The clarity of explanation makes complex ideas accessible, particularly in sections like Appendices A and B, which provide valuable insights into the nuances of different approaches. 2. The novelty of the new formulation for the Longhorn approach is impressive. The retrieval-based perspective is both innovative and elegantly presented, offering a fresh solution that enhances the field. 3. The exploration of SSM structure variances through online l
1. While this paper presents a focused study on architecture, the data and model scale seem limited. Expanding the experimental scale and providing a more comprehensive analysis would significantly enhance the paper's impact. 2. The reduction in perplexity compared to Mamba is notable. However, the results in Table 2 appear mixed, which could benefit from further clarification or exploration. 3. Including additional experiments, such as MMLU, GSM-8K, and more extensive long-context benchmarks, w
- the online learning framework provides a fair theoretical underpinning for understanding the Linear attention model / SSMs. This approach not only supports the conceptual innovations presented but also enhances the interpretability of SSM behaviors in practical applications. - emirpical results: Longhorn has good sample efficiency compared to STOA models such as Mamba and GLA. This advantage is critical in scenarios where computational resources are limited.
Appoximation: while the diagonal approximation is a key aspect of Longhorn's implementation, its impact on the theoretical framework's alignment with empirical results remains unclear to me. I would expect a deeper exploration into how this approximation influences model performance could bridge the gap between theoretical predictions and observed outcomes.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding
