
TL;DR
The Spectral-Window Hybrid (SWH) architecture combines global spectral methods and local attention to efficiently model long sequences, matching Transformer performance on short contexts while scaling linearly for extended sequences.
Contribution
SWH introduces a novel parallel architecture that decouples global and local sequence modeling, achieving efficient long-range context handling with linear complexity.
Findings
SWH matches Transformer perplexity on short sequences.
SWH scales linearly to longer sequences.
Efficient long-range sequence modeling without quadratic complexity.
Abstract
Scaling sequence modeling to extreme contexts requires balancing computational efficiency with representational expressivity. While Transformers provide precise retrieval via the attention mechanism, their quadratic complexity limits their application to long-horizon tasks. In this work, we propose the \textbf{Spectral-Window Hybrid (SWH)}, an architecture that decouples sequence modeling into two \textit{parallel} streams: a global branch utilizing the Convolution Theorem to model long-range decay dynamics in time, and a local branch employing sliding-window attention for token interactions within a bounded context. By aggregating these representations, SWH avoids the computational bottleneck of global attention while retaining local precision. We demonstrate that SWH matches the perplexity of standard Transformers on short contexts while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Algorithms and Data Compression · Parallel Computing and Optimization Techniques
