Attention Residuals

Kimi Team: Guangyu Chen; Yu Zhang; Jianlin Su; Weixin Xu; Siyuan Pan; Yaoyu Wang; Yucheng Wang; Guanduo Chen; Bohong Yin; Yutian Chen; Junjie Yan; Ming Wei; Y. Zhang; Fanqing Meng; Chao Hong; Xiaotong Xie; Shaowei Liu; Enzhe Lu; Yunpeng Tai; Yanru Chen; Xin Men; Haiqing Guo; Y. Charles; Haoyu Lu; Lin Sui; Jinguo Zhu; Zaida Zhou; Weiran He; Weixiao Huang; Xinran Xu; Yuzhi Wang; Guokun Lai; Yulun Du; Yuxin Wu; Zhilin Yang; Xinyu Zhou

arXiv:2603.15031·cs.CL·March 17, 2026

Attention Residuals

Kimi Team: Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo

PDF

Open Access

TL;DR

This paper introduces Attention Residuals (AttnRes), a novel method replacing fixed residual accumulation with input-dependent attention, improving model depth handling and performance in large language models.

Contribution

It proposes AttnRes and Block AttnRes, enabling selective, content-dependent aggregation of layer outputs, reducing dilution and overhead in large-scale models.

Findings

01

AttnRes improves uniformity of output magnitudes and gradients.

02

Block AttnRes reduces memory overhead while maintaining performance gains.

03

Pre-training with AttnRes enhances downstream task performance.

Abstract

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning