TL;DR
The paper introduces the Integral Transformer, a self-attention mechanism that denoises attention by integrating signals from the logit distribution, improving performance and reducing noise in language models.
Contribution
It proposes a novel denoising self-attention method that preserves important tokens while reducing attention noise, outperforming existing approaches.
Findings
Outperforms vanilla, Cog, and Differential attention on benchmarks.
Employing vanilla attention in lower layers improves performance.
Effectively balances attention distribution and reduces rank collapse.
Abstract
Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
