TL;DR
This paper introduces Delta Attention Residuals, a novel attention mechanism that attends over layer-wise deltas instead of cumulative states, leading to more selective routing and improved performance in deep models.
Contribution
The paper proposes Delta Attention Residuals, a new method that enhances attention routing by focusing on changes between layers, improving model selectivity and accuracy.
Findings
Delta Attention Residuals outperform standard residuals and Attention Residuals across multiple scales.
Higher-contrast attention distributions enable more effective layer-wise routing.
Fine-tuning pretrained models with Delta Attention Residuals is straightforward and beneficial.
Abstract
Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight 0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer () -- instead of cumulative states. Delta representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
