Kwai Summary Attention Technical Report

Chenglong Chu; Guorui Zhou; Guowang Zhang; Han Li; Hao Peng; Hongtao Cheng; Jian Liang; Jiangxia Cao; Kun Gai; Lingzhi Zhou; Lu Ren; Qi Zhang; Ruiming Tang; Ruitao Wang; Xinchen Luo; Yi Su; Zhiyuan Liang; Ziqi Wang; Boyang Ding; Chengru Song; Dunju Zang; Hui Wang; Jiao Ou; Jiaxin Deng; Jijun Shi; Jinghao Zhang; Junmin Chen; Lejian Ren; Minxuan Lv; Qianqian Wang; Qigen Hu; Shiyao Wang; Siyang Mao; Tao Wang; Xingmei Wang; Zhixin Ling; Ziming Li; Zixing Zhang

arXiv:2604.24432·cs.CL·April 28, 2026

Kwai Summary Attention Technical Report

Chenglong Chu, Guorui Zhou, Guowang Zhang, Han Li, Hao Peng, Hongtao Cheng, Jian Liang, Jiangxia Cao, Kun Gai, Lingzhi Zhou, Lu Ren, Qi Zhang, Ruiming Tang, Ruitao Wang, Xinchen Luo, Yi Su, Zhiyuan Liang, Ziqi Wang, Boyang Ding, Chengru Song, Dunju Zang, Hui Wang, Jiao Ou

PDF

1 Models

TL;DR

The paper introduces Kwai Summary Attention, a novel mechanism that compresses long sequence contexts into summary tokens to improve efficiency in large language models.

Contribution

It proposes a new attention method that balances memory and long-context modeling by semantic-level compression, filling a gap between existing techniques.

Findings

01

Reduces sequence modeling cost to O(n/k) with compression ratio k.

02

Maintains interpretability and referential long-distance dependencies.

03

Offers a trade-off between memory use and long-context effectiveness.

Abstract

Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
OpenOneRec/KSA-4B-base
model· 24 dl· ♡ 1
24 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.