The Routing and Filtering Structure of Attention
Shafayeth Jamil, Rehan Kapadia

TL;DR
This paper analyzes the structure of attention in transformers, decomposing it into routing and filtering components, and introduces a diagnostic parameterization to improve interpretability and efficiency.
Contribution
It introduces $S$-$D$ attention to disentangle routing from filtering, revealing spectral cascades and enabling simplified, efficient attention mechanisms.
Findings
Routing operates at low rank, below the allocated capacity.
Linearizing early layers of $S$-$D$ attention costs less than 5% perplexity.
Cascade architectures reduce attention parameters significantly with minimal perplexity increase.
Abstract
The attention interaction matrix contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel. We introduce - attention as a diagnostic parameterization that disentangles routing from filtering by construction with guaranteed stability () and trains stably without layer normalization. When disentangled and unnormalized, routing self-organizes into a spectral cascade, effective rank at the first layer, expanding with depth across six scales from 7M to 355M parameters. The cascade predicts where attention can be simplified: linearizing the first seven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
