Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team: Yu Zhang; Zongyu Lin; Xingcheng Yao; Jiaxi Hu; Fanqing Meng; Chengyin Liu; Xin Men; Songlin Yang; Zhiyuan Li; Wentao Li; Enzhe Lu; Weizhou Liu; Yanru Chen; Weixin Xu; Longhui Yu; Yejie Wang; Yu Fan; Longguang Zhong; Enming Yuan; Dehao Zhang; Yizhi Zhang; T.Y. Liu; Haiming Wang; Shengjun Fang; Weiran He; Shaowei Liu; Yiwei Li; Jianlin Su; Jiezhong Qiu; Bo Pang; Junjie Yan; Zhejun Jiang; Weixiao Huang; Bohong Yin; Jiacheng You; Chu Wei; Zhengtao Wang; Chao Hong; Yutian Chen; Guanduo Chen; Yucheng Wang; Huabin Zheng; Feng Wang; Yibo Liu; Mengnan Dong; Zheng Zhang; Siyuan Pan; Wenhao Wu; Yuhao Wu; Longyu Guan; Jiawen Tao; Guohong Fu; Xinran Xu; Yuzhi Wang; Guokun Lai; Yuxin Wu; Xinyu Zhou; Zhilin Yang; Yulun Du

arXiv:2510.26692·cs.CL·November 4, 2025

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team: Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T.Y. Liu

PDF

6 Models

TL;DR

Kimi Linear introduces a novel hybrid linear attention architecture that outperforms full attention in various scenarios, offering improved efficiency and scalability for large models and long-context tasks.

Contribution

The paper presents Kimi Linear, a new expressive linear attention architecture with a specialized algorithm, outperforming full attention in multiple settings and reducing computational costs.

Findings

01

Outperforms full attention across tasks with similar training recipes.

02

Reduces KV cache usage by up to 75%.

03

Achieves up to 6x decoding throughput for 1M context.

Abstract

We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.