Attention Sinks and Outliers in Attention Residuals
Haozheng Luo, Haoran Dai, Shaoyang Zhang, Xi Chen, Eric Hanchen Jiang, Yijiang Li, Jingyuan Huang, Chenghao Qiu, Chenwei Xu, Zhenyu Pan, Haotian Zhang, Binghui Wang, Yan Chen

TL;DR
This paper introduces OASIS, a novel technique to mitigate attention sinks and outliers in attention residual architectures, enhancing robustness and quantization stability.
Contribution
OASIS employs a null space approach with inter-layer null signaling to reduce sink-dominated routing and improve inference robustness in attention models.
Findings
OASIS reduces maximum infinity norm by 9.26%.
OASIS lowers kurtosis by 2.60%.
OASIS improves perplexity and Pass@1 metrics significantly.
Abstract
We propose OASIS, an outlier- and sink-aware technique built on inter-layer null signaling. As AttnResidual architectures introduce an additional depth-wise normalization channel, they improve inter-layer routing flexibility but also exacerbate attention sinks, activation outliers, and the resulting degradation in inference stability and quantization robustness. OASIS addresses this issue by introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal, thereby reducing sink-dominated routing and improving structural robustness. Theoretically, we show that the dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness. Experimentally, we compare OASIS against five baselines on three real-world datasets and observe consistent improvements in both attention sink and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
