Integral Transformer: Denoising Attention, Not Too Much Not Too Little

Ivan Kobyzev; Abbas Ghaddar; Dingtao Hu; Boxing Chen

arXiv:2508.18387·cs.CL·August 27, 2025

Integral Transformer: Denoising Attention, Not Too Much Not Too Little

Ivan Kobyzev, Abbas Ghaddar, Dingtao Hu, Boxing Chen

PDF

1 Video

TL;DR

The paper introduces the Integral Transformer, a self-attention mechanism that denoises attention by integrating signals from the logit distribution, improving performance and reducing noise in language models.

Contribution

It proposes a novel denoising self-attention method that preserves important tokens while reducing attention noise, outperforming existing approaches.

Findings

01

Outperforms vanilla, Cog, and Differential attention on benchmarks.

02

Employing vanilla attention in lower layers improves performance.

03

Effectively balances attention distribution and reduces rank collapse.

Abstract

Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Integral Transformer: Denoising Attention, Not Too Much Not Too Little· underline