Generalized Probabilistic Attention Mechanism in Transformers
DongNyeong Heo, Heeyoul Choi

TL;DR
This paper introduces a novel attention mechanism called generalized probabilistic attention (GPAM) for Transformers, which effectively mitigates rank-collapse and gradient vanishing issues, improving performance in NLP tasks.
Contribution
The paper proposes GPAM and its dual-attention implementation, providing theoretical analysis and empirical validation that it overcomes key limitations of conventional attention mechanisms.
Findings
daGPAM mitigates rank-collapse and gradient vanishing
Empirical results show superior performance in NLP tasks
Theoretical analysis supports effectiveness of GPAM
Abstract
The Transformer architecture has become widely adopted due to its demonstrated success, attributed to the attention mechanism at its core. Despite these successes, the attention mechanism of Transformers is associated with two well-known issues: rank-collapse and gradient vanishing. In this paper, we present a theoretical analysis that it is inherently difficult to address both issues simultaneously in the conventional attention mechanism. To handle these issues, we introduce a novel class of attention mechanism, referred to as generalized probabilistic attention mechanism (GPAM), and its dual-attention implementation within the Transformer architecture. Unlike conventional attention mechanisms, GPAM allows for negative attention scores while preserving a fixed total sum. We provide theoretical evidence that the proposed dual-attention GPAM (daGPAM) effectively mitigates both the…
Peer Reviews
Decision·Submitted to ICLR 2025
The method and presentation are good. They give intuitive, theoretical and experiential justification of the ideas. The experiment results improve over the baseline.
The range of evaluations is pretty limited and extra computation is needed. A benchmark against just increasing the number of attention heads would be useful.
- The paper includes theory on how daGPAM can reduce rank collapse and vanishing gradients. It does so by showing the residual norm of daGPAM and also its gradient is greater than the respective values for vanilla attention (the original values for attention having been computed by [[Dong et al](https://proceedings.mlr.press/v139/dong21a.html)]). - The paper is well written and easy to follow, and contextualizes relevant work well.
- I think the downstream experiments for this paper could be fleshed out more. I appreciate that some sort of comparison with alternative attention mechanisms like CoDA is present, but I think only having these comparisons on PTB is insufficient. So many attention variants have been proposed over the years, and none have really taken hold, so I think the empirical bar should be high for current and future variants. Even though this paper shows evidence of reducing rank collapse, rank collapse is
1. The writing is easy to follow 2. The paper shows quite interesting results gradient vanishing and rank-collapse is hard to solve simultaneously total norm of gradients is maximized can lead to rank-collapse. 3. The authors theoretically show that their method can mitigate both the rank-collapse issue and reduce the the gradient vanishing problem.
My main concerns are: 1. The improvement of daGPAM is very marginal, and not significant in both Language modeling (on Wikitext-103 and Enwiki8, the improvement is around 0,5 PPL) and machine translation tasks (on IWSLT14 and WMT14, maximum improvement is around 0.7% BLEU score.) 2. Computational cost: the dual-attention structure requires computing two attention matrices instead of one, which is very inefficient. The marginal improvement does not justify this performance-efficiency trade-off.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsDense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Adam · Linear Layer · Softmax · Multi-Head Attention · Dropout
