MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM Team: Wenhao An; Yingfa Chen; Yewei Fang; Jiayi Li; Xin Li; Yaohui Li; Yishan Li; Yuxuan Li; Biyuan Lin; Chuan Liu; Hezi Liu; Siyuan Liu; Hongya Lyu; Yinxu Pan; Shixin Ren; Xingyu Shen; Zhou Su; Haojun Sun; Yangang Sun; Zhen Leng Thai; Xin Tian; Rui Wang; Xiaorong Wang; Yudong Wang; Bo Wu; Xiaoyue Xu; Dong Xu; Shuaikang Xue; Jiawei Yang; Bowen Zhang; Jinqian Zhang; Letian Zhang; Shengnan Zhang; Xinyu Zhang; Xinyuan Zhang; Zhu Zhang; Hengyu Zhao; Jiacheng Zhao; Zhi Zheng; Jie Zhou; Zihan Zhou; Shuo Wang; Chaojun Xiao; Xu Han; Zhiyuan Liu; Maosong Sun

arXiv:2602.11761·cs.CL·March 3, 2026

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM Team: Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li, Xin Li, Yaohui Li, Yishan Li, Yuxuan Li, Biyuan Lin, Chuan Liu, Hezi Liu, Siyuan Liu, Hongya Lyu, Yinxu Pan, Shixin Ren, Xingyu Shen, Zhou Su, Haojun Sun, Yangang Sun, Zhen Leng Thai, Xin Tian, Rui Wang, Xiaorong Wang

PDF

Open Access

TL;DR

MiniCPM-SALA is a hybrid long-context model combining sparse and linear attention mechanisms, achieving high efficiency and performance at unprecedented sequence lengths with significantly reduced training costs.

Contribution

The paper introduces MiniCPM-SALA, a novel hybrid attention architecture that effectively combines sparse and linear attention for ultra-long context modeling, along with a cost-effective training framework.

Findings

01

Achieves up to 3.5x inference speed at 256K tokens

02

Supports context lengths up to 1 million tokens

03

Reduces training costs by approximately 75%

Abstract

The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Topic Modeling