Sliding Window Recurrences for Sequence Models
Dragos Secrieru, Garyk Brixi, Yoshua Bengio, Taiji Suzuki, Michael Poli, Stefano Massaroli

TL;DR
This paper introduces Sliding Window Recurrences (SWR), a hierarchical decomposition framework for linear recurrences in sequence models, enabling efficient GPU-aligned algorithms that improve speed while maintaining model quality.
Contribution
The authors develop a novel SWR framework and Phalanx layers that serve as drop-in replacements for existing attention mechanisms, achieving significant speedups in large-scale language models.
Findings
SWR enables 10-40% speedup in 1B parameter models.
Phalanx layers match perplexity of traditional methods.
Efficient GPU-aligned algorithms reduce inter-warp communication.
Abstract
Multi-hybrid architectures are poised to take over language modeling due to better quality and performance. We introduce a hierarchical decomposition framework for linear recurrences that allows us to develop algorithms aligned with GPU memory hierarchies, yielding Sliding Window Recurrences. We focus specifically on truncating recurrences to hardware-aligned windows which are naturally jagged, limiting costly inter-warp communication. Using SWR, we develop Phalanx layers that serve as drop-in replacements for windowed attention or linear recurrences. In 1B parameter multi-hybrid models, Phalanx achieves over 10-40% speedup across 4K to 32K context length over optimized Transformers while matching perplexity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Logic, programming, and type systems
