Linear Log-Normal Attention with Unbiased Concentration
Yury Nahshan, Joseph Kampeas, Emir Haleva

TL;DR
This paper introduces Linear Log-Normal Attention, a new self-attention mechanism that mimics the original's distribution and concentration, improving scalability for long sequences while maintaining performance.
Contribution
The paper proposes a novel Linear Log-Normal Attention mechanism that emulates the distribution and concentration of standard attention, enhancing scalability in transformer models.
Findings
Outperforms other linearized attention methods on NLP benchmarks
Maintains similar distribution and concentration properties as original attention
Offers scalable attention mechanism for long sequences
Abstract
Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This limitation poses a substantial obstacle when dealing with long documents or high-resolution images. In this work, we study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. Furthermore, we propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention, designed to emulate the distribution and concentration behavior of the original self-attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives, offering a…
Peer Reviews
Decision·ICLR 2024 poster
- The paper is very well-written with detailed theoretical justification and many useful intuitive explanations. - The authors have characterized three important and interesting properties of the SA mechanism: (1) the distribution of the SA matrix $\mathbf{P}^{(SM)}$ can be approximated by a log-normal distribution, (2) the entropy $H(\mathbf{P}^{(SM)})$ is monotonically increasing with temperature, while (3) the variance of the attention matrix $\mathbf{P}^{(SM)}$ is decreasing with temperatur
I do not see any major weaknesses of the paper. There are a few minor weaknesses as follows: - Some notions are introduced with formulas but without intuitive explanations. For example, why is $\tau_{sm}$ in Eq. (5) called the "temperature" of the SA? and why controls the level of exploration and exploitation? - In Eq. (6), $P_{ij}^{(SM)}$ is used, while in Eq. (7), it is $\mathbf{P}_{ij}^{(SM)}." - I am not sure where the proof of Theorem 3.4 is in the Appendix. Is it Theorem A.3?
- This paper made contributions in 1) in-depth analysis and modeling of distributional and concentration properties of softmax attention; 2) Design of LLN Attention method that emulates softmax attention based on this analysis; 3) Introduction of moment matching technique to align concentration behavior; 4) Linear complexity in sequence length while maintaining softmax attention performance. The proposed LLN Attention offers an interesting and somewhat promising approach to enhance transformer s
- The theoretical analysis relies on several approximations, such as using the Fenton theorem to model log-normal sums. While justified, evaluating the accuracy of these approximations on empirical data could be beneficial. - The paper focuses exclusively on natural language tasks. Assessing the effectiveness of LLN Attention on other modalities like computer vision with ViTs could provide more insight into its general applicability. - Only accuracy results are reported. Including other metrics
The analysis of this this paper is good and the design of linearized attention is interesting. The experiments and the visualized results are easy to follow.
1. The main concern of this paper is its results. Compared with Nystromformer, which published two years ago, this paper dose not improve enough, from both results and efficiency, as shown in Table 1 and Table 2. 2. Only NLP tasks are adopted, but computer vision based tasks are neglected. 3. Some related works are missing, such as KVT (ECCV2022). 4. The experiments are not extensive.
Code & Models
Videos
Taxonomy
TopicsBrain Tumor Detection and Classification · Neural Networks and Applications · Advanced Neural Network Applications
