Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra,, Christopher R\'e

TL;DR
This paper advances language modeling by developing a new SSM layer called H3 that improves expressivity, and a novel training algorithm FlashConv that enhances efficiency, resulting in models that outperform Transformers on several benchmarks.
Contribution
Introduces H3, a new SSM layer designed for language tasks, and FlashConv, a fast FFT-based algorithm, to close the gap between SSMs and attention-based models.
Findings
H3 matches attention on synthetic tasks and approaches Transformer perplexity on OpenWebText.
Hybrid H3-attention models outperform Transformers in perplexity on OpenWebText.
FlashConv accelerates training and inference, enabling larger models with better performance.
Abstract
State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
