Attention layers provably solve single-location regression
Pierre Marion, Rapha\"el Berthier, G\'erard Biau, Claire Boyer

TL;DR
This paper introduces the single-location regression task to analyze attention mechanisms, demonstrating their ability to handle sparse token information and internal linear structures through theoretical analysis of a dedicated predictor.
Contribution
It provides a theoretical framework for understanding how attention layers can solve sparse, token-wise regression tasks, including asymptotic optimality and training dynamics analysis.
Findings
The predictor is asymptotically Bayes optimal.
Attention layers can recover latent token positions.
The training dynamics effectively learn the underlying structure.
Abstract
Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper is well-written and easy to follow. For example, the step-by-step illustration of how to connect the construction to the attention mechanism in Section 3 is helpful for understanding. 2. The choice of model is good, like using [CLS] property which is observed in empirical study in (5) is natural and reasonable, and using erf as nonlinear weight function is also reasonable. In general, the theoretical result is solid. 3. Also contain some empirical result for showing the convergence
1. The setting is restricted to single position token, although it focus on the sparse attention settings and it's already difficult to analyze, it's still far from the real-world case. Besides, the authors haven't done experiments on real-world experiments (like sentimental tasks as shown in Figure 1) to support some claims in the paper, this may kind of reduce the impact of the theoretical analysis. But in general it's already good as a theoretical-centric paper.
The paper is well-written and easy to follow. A novel task called "single-location regression task" is introduced to satisfy the sparsity of the token and model real-world tasks to some extent. Despite the non-convexity and non-linearity, the paper is able to analyze the training dynamics and show the asymptotic Bayes optimality.
1. The proposed task may be over-simplified and lack generality. For instance, it assumes that the tokens other than $X_{J_0}$ have zero mean and only one token contains information. 2. The paper shows the connection to a single self-attention layer by using the assumption that $p=1$. Although the low-rank property may come true after the training process, it is so strong to make this assumption directly. 3. It is uncommon to use the function $erf$ to replace the softmax function. To demonstrate
- The paper is extremely well written and very easy and pleasant to read. - The analysis is theoretically strong and rigorous - More generally, this work promotes an approach that is worth being acknowledged and valued: looking into a simpler problem than the ones practitioners can face, yet relevant, and solve it completely and rigorously. - Additionally, the problem is well connected to practical concerns, with the authors made a convincing case for the significance of their analysis.
By decreasing order of importance - The present work’s approach is not so well connected to the existing literature in line 103: "note that our task shares similarities with single-index models (McCulllagh & Nelder, 1983) and mixtures of linear regressions (De Veaux, 1989)". I see the differences between those works and the present one being hightlighted in the following sentence , but the exact nature of these similarities is not clear. Could this connection be elaborated? - Minor: In the capt
Code & Models
Videos
Taxonomy
TopicsIndoor and Outdoor Localization Technologies · Sparse and Compressive Sensing Techniques · Machine Learning and ELM
MethodsLinear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Attention Is All You Need · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding
