Geometry of Lightning Self-Attention: Identifiability and Dimension

Nathan W. Henry; Giovanni Luca Marchetti; Kathl\'en Kohn

arXiv:2408.17221·cs.LG·February 20, 2025

Geometry of Lightning Self-Attention: Identifiability and Dimension

Nathan W. Henry, Giovanni Luca Marchetti, Kathl\'en Kohn

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper explores the geometric properties of self-attention networks, analyzing their function spaces using algebraic geometry to understand identifiability, dimension, and singularities, with extensions to normalized models.

Contribution

It provides a theoretical geometric analysis of self-attention networks, including identifiability, dimension, and boundary points, extending to normalized models.

Findings

01

Characterized the generic fibers of the parametrization for deep attention networks.

02

Computed the dimension of the function space for arbitrary layers.

03

Proved and numerically verified conjectures for normalized self-attention networks.

Abstract

We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 2

Strengths

S1. The paper exposition is quite clear, striking a balance between the lingo of algebraic geometry and the context of the ICLR audience. S2. The empirical popularity of attention-based networks, and the relative lack of understanding on the properties of said architectures (with respect to fully connected and convolutional ones, as the authors point out in L51-53) sufficiently justifies the theoretical exploration presented in this submission. S3. The authors identify lightning self-attention

Weaknesses

While S2 can warrant sufficient interest from the ICLR community, I think the submission falls short of adequately examining the implications of its findings. I think the authors could have expanded more on the “so what?” question after reaching their (interesting!) core theoretical goals. W1. For example, in L416-419, the authors are quite descriptive about the dimension of the neuromanifold for a contemporary language model being ~8.2B vs the crude parameter dimension of ~8.6B. Yet, there is

Reviewer 02Rating 6Confidence 4

Strengths

The contributions reveal interesting properties of the geometry of lightning attention models and provide a rigorous characterization of the resulting functional space. The results are well-described and relatively straight-forward to follow. As far as I can tell (although I am no expert in algebraic geometry), the claims are clearly stated and the mathematics are correct.

Weaknesses

There are several key gaps between the models studied in this paper and practical models: * The lack of non-linearities in the attention unit (in particular, the lack of softmax). * The lack of residual connections between layers. * The lack of element-wise multi-layer perceptrons between layers. As far as I am aware, all three components are important for practical transformers, and that a lack of residual connections or MLPs tend to lead to degeneracy in training. While the manifold analysis i

Reviewer 03Rating 6Confidence 2

Strengths

The paper presents a novel and rigorous theoretical analysis of undoubtedly important topics, i.e., attention mechanisms. The authors use algebraic geometry to develop a solid theoretical understanding of the neuromanifold of neural networks using attention mechanisms. Although the analysis is offered in a simplified context, that is standard and inevitable for a rigorous theoretical analysis and does not reduce the importance of their contribution. I believe the introduction motivates the study

Weaknesses

**Key weakness:** The main weakness of the paper which is present throughout the paper is that the ultimate goal of the theoretical results as well as their implications tend to seem lost and poorly explained. I believe many among the ML audience of this paper might raise the objection that “this paper belongs to a math journal, not ICLR”. I take the liberty to clarify that that is not my objection, and I believe even a paper that solely focuses on deriving results and developing mathematical to

Code & Models

Repositories

giovanni-marchetti/NeuroDim
pytorchOfficial

Videos

Geometry of Lightning Self-Attention: Identifiability and Dimension· slideslive

Taxonomy

TopicsFire Detection and Safety Systems · Advanced Decision-Making Techniques · Impact of Light on Environment and Health

MethodsSoftmax · Attention Is All You Need