Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat
Keston Aquino-Michaels

TL;DR
This paper investigates why learned sparse attention in transformers often fails to outperform random gating, revealing a phenomenon called routing absorption where models co-adapt to the gating mechanism, limiting the effectiveness of learned routing.
Contribution
The study introduces the concept of routing absorption in sparse attention, providing four lines of evidence and connecting it to similar issues in Mixture-of-Experts, highlighting the limitations of end-to-end learned gating.
Findings
Differentiable soft gating converges to similar perplexity whether learned or random.
Hard top-k gating receives zero gradient through the mask.
Stochastic mask randomization fails to prevent co-adaptation.
Abstract
Can a transformer learn which attention entries matter during training? In principle, yes: attention distributions are highly concentrated, and a small gate network can identify the important entries post-hoc with near-perfect accuracy. In practice, barely. When sparse attention is trained end-to-end, the model's Q/K/V projections co-adapt to whatever mask is imposed, absorbing the routing signal until learned gates perform little better than frozen random gates. We call this routing absorption and present four independent lines of evidence for it in a controlled 31M-parameter transformer: (1) differentiable soft gating converges to nearly the same perplexity whether the gate is learned or random (48.73 +/- 0.60 vs. 49.83 +/- 0.04 over 3 seeds); (2) hard top-k gating receives exactly zero gradient through the mask; (3) a gate distilled onto co-adapted Q/K/V achieves high F1 against…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques
