Delta Attention Residuals

Cheng Luo; Zefan Cai; Junjie Hu

arXiv:2605.18855·cs.LG·May 20, 2026

Delta Attention Residuals

Cheng Luo, Zefan Cai, Junjie Hu

PDF

1 Repo

TL;DR

This paper introduces Delta Attention Residuals, a novel attention mechanism that attends over layer-wise deltas instead of cumulative states, leading to more selective routing and improved performance in deep models.

Contribution

The paper proposes Delta Attention Residuals, a new method that enhances attention routing by focusing on changes between layers, improving model selectivity and accuracy.

Findings

01

Delta Attention Residuals outperform standard residuals and Attention Residuals across multiple scales.

02

Higher-contrast attention distributions enable more effective layer-wise routing.

03

Fine-tuning pretrained models with Delta Attention Residuals is straightforward and beneficial.

Abstract

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight $\approx$ 0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer ( $v_{i} = h_{i + 1} - h_{i}$ ) -- instead of cumulative states. Delta representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wdlctc/delta-attention-residuals-code
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.