Gating Enables Curvature: A Geometric Expressivity Gap in Attention
Satwik Bathula, Anand A. Joshi

TL;DR
This paper investigates how multiplicative gating in attention mechanisms enhances geometric expressivity, enabling more complex representations and improving performance on nonlinear tasks.
Contribution
It reveals a geometric gap between gated and ungated attention, showing gating allows non-flat, curved geometries in representation space.
Findings
Gated attention models exhibit higher representation curvature.
Gated models perform better on nonlinear decision boundary tasks.
Curvature accumulates with depth, amplifying expressivity.
Abstract
Multiplicative gating is widely used in neural architectures and has recently been applied to attention layers to improve performance and training stability in large language models. Despite the success of gated attention, the mathematical implications of gated attention mechanisms remain poorly understood. We study attention through the geometry of its representations by modeling outputs as mean parameters of Gaussian distributions and analyzing the induced Fisher--Rao geometry. We show that ungated attention operator is restricted to intrinsically flat statistical manifolds due to its affine structure, while multiplicative gating enables non-flat geometries, including positively curved manifolds that are unattainable in the ungated setting. These results establish a geometric expressivity gap between ungated and gated attention. Empirically, we show that gated models exhibit higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
