Flash STU: Fast Spectral Transform Units

Y. Isabel Liu; Windsor Nguyen; Yagiz Devre; Evan Dogariu; Anirudha Majumdar; Elad Hazan

arXiv:2409.10489·cs.LG·January 21, 2026

Flash STU: Fast Spectral Transform Units

Y. Isabel Liu, Windsor Nguyen, Yagiz Devre, Evan Dogariu, Anirudha Majumdar, Elad Hazan

PDF

1 Repo

TL;DR

Flash STU introduces a hybrid spectral state space and attention architecture that scales efficiently to billions of parameters, outperforming Transformers and other models on various sequence prediction tasks.

Contribution

The paper presents the Flash STU, a novel hybrid model combining spectral state space layers with sliding window attention for scalable and efficient sequence modeling.

Findings

01

Outperforms Transformers, S4, and Mamba-2 given the same parameter budget.

02

Scales to billions of parameters with near-linear time complexity.

03

Effective across diverse sequence prediction tasks.

Abstract

Recent advances in state-space model architectures have shown great promise for efficient sequence modeling, but challenges remain in balancing computational efficiency with model expressiveness. We propose the Flash STU architecture, a hybrid model that interleaves spectral state space model layers with sliding window attention, enabling scalability to billions of parameters for language modeling while maintaining a near-linear time complexity. We evaluate the Flash STU and its variants on diverse sequence prediction tasks, including linear dynamical systems, robotics control, and language modeling. We find that, given a fixed parameter budget, the Flash STU architecture consistently outperforms the Transformer and other leading state-space models such as S4 and Mamba-2.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

windsornguyen/flash-stu
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer