Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention
Athanasios Zeris

TL;DR
This paper introduces Energy-Gated Attention (EGA), a simple spectral energy-based modification to transformer attention that improves performance by emphasizing tokens with higher informational content, validated on multiple datasets.
Contribution
The paper proposes EGA, a novel spectral energy gating mechanism for transformer attention, demonstrating its effectiveness and dataset independence, and exploring optimal wavelet bases for spectral analysis.
Findings
EGA improves validation loss on TinyShakespeare by +0.103 with minimal overhead.
EGA achieves similar improvements on Penn Treebank (+0.101).
Learned spectral energy thresholds align with linguistic properties of English text.
Abstract
Standard transformer attention computes pairwise similarity between queries and keys, treating all tokens as equally salient regardless of their intrinsic informational content. In turbulent fluid dynamics, coherent structures -- the energetically dominant, spatially organized patterns that persist amid background chaos -- carry a disproportionate fraction of total energy and govern all transport. We propose that tokens play an analogous role in transformer attention: informationally dense positions (morphological boundaries, syntactic heads, discourse markers) concentrate spectral energy and should attract proportionally more attention than background tokens (function words, repeated patterns, low-information filler). We propose Energy-Gated Attention (EGA): a simple modification that gates value aggregation by the spectral energy of key token embeddings, computed by a single learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
