TL;DR
Flash STU introduces a hybrid spectral state space and attention architecture that scales efficiently to billions of parameters, outperforming Transformers and other models on various sequence prediction tasks.
Contribution
The paper presents the Flash STU, a novel hybrid model combining spectral state space layers with sliding window attention for scalable and efficient sequence modeling.
Findings
Outperforms Transformers, S4, and Mamba-2 given the same parameter budget.
Scales to billions of parameters with near-linear time complexity.
Effective across diverse sequence prediction tasks.
Abstract
Recent advances in state-space model architectures have shown great promise for efficient sequence modeling, but challenges remain in balancing computational efficiency with model expressiveness. We propose the Flash STU architecture, a hybrid model that interleaves spectral state space model layers with sliding window attention, enabling scalability to billions of parameters for language modeling while maintaining a near-linear time complexity. We evaluate the Flash STU and its variants on diverse sequence prediction tasks, including linear dynamical systems, robotics control, and language modeling. We find that, given a fixed parameter budget, the Flash STU architecture consistently outperforms the Transformer and other leading state-space models such as S4 and Mamba-2.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer
