Geometry of Lightning Self-Attention: Identifiability and Dimension
Nathan W. Henry, Giovanni Luca Marchetti, Kathl\'en Kohn

TL;DR
This paper explores the geometric properties of self-attention networks, analyzing their function spaces using algebraic geometry to understand identifiability, dimension, and singularities, with extensions to normalized models.
Contribution
It provides a theoretical geometric analysis of self-attention networks, including identifiability, dimension, and boundary points, extending to normalized models.
Findings
Characterized the generic fibers of the parametrization for deep attention networks.
Computed the dimension of the function space for arbitrary layers.
Proved and numerically verified conjectures for normalized self-attention networks.
Abstract
We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.
Peer Reviews
Decision·ICLR 2025 Poster
S1. The paper exposition is quite clear, striking a balance between the lingo of algebraic geometry and the context of the ICLR audience. S2. The empirical popularity of attention-based networks, and the relative lack of understanding on the properties of said architectures (with respect to fully connected and convolutional ones, as the authors point out in L51-53) sufficiently justifies the theoretical exploration presented in this submission. S3. The authors identify lightning self-attention
While S2 can warrant sufficient interest from the ICLR community, I think the submission falls short of adequately examining the implications of its findings. I think the authors could have expanded more on the “so what?” question after reaching their (interesting!) core theoretical goals. W1. For example, in L416-419, the authors are quite descriptive about the dimension of the neuromanifold for a contemporary language model being ~8.2B vs the crude parameter dimension of ~8.6B. Yet, there is
The contributions reveal interesting properties of the geometry of lightning attention models and provide a rigorous characterization of the resulting functional space. The results are well-described and relatively straight-forward to follow. As far as I can tell (although I am no expert in algebraic geometry), the claims are clearly stated and the mathematics are correct.
There are several key gaps between the models studied in this paper and practical models: * The lack of non-linearities in the attention unit (in particular, the lack of softmax). * The lack of residual connections between layers. * The lack of element-wise multi-layer perceptrons between layers. As far as I am aware, all three components are important for practical transformers, and that a lack of residual connections or MLPs tend to lead to degeneracy in training. While the manifold analysis i
The paper presents a novel and rigorous theoretical analysis of undoubtedly important topics, i.e., attention mechanisms. The authors use algebraic geometry to develop a solid theoretical understanding of the neuromanifold of neural networks using attention mechanisms. Although the analysis is offered in a simplified context, that is standard and inevitable for a rigorous theoretical analysis and does not reduce the importance of their contribution. I believe the introduction motivates the study
**Key weakness:** The main weakness of the paper which is present throughout the paper is that the ultimate goal of the theoretical results as well as their implications tend to seem lost and poorly explained. I believe many among the ML audience of this paper might raise the objection that “this paper belongs to a math journal, not ICLR”. I take the liberty to clarify that that is not my objection, and I believe even a paper that solely focuses on deriving results and developing mathematical to
Code & Models
Videos
Taxonomy
TopicsFire Detection and Safety Systems · Advanced Decision-Making Techniques · Impact of Light on Environment and Health
MethodsSoftmax · Attention Is All You Need
