Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun

TL;DR
This paper introduces the Ring-linear model series, a hybrid attention architecture that significantly reduces inference costs for long-context reasoning while maintaining state-of-the-art performance, and improves training efficiency with specialized hardware.
Contribution
The paper presents a novel hybrid attention architecture combining linear and softmax attention, optimized for long-context inference, with systematic exploration of attention ratios and high-performance training tools.
Findings
Reduces inference cost to 1/10 of dense models
Achieves over 50% reduction in inference cost compared to previous Ring models
Maintains state-of-the-art performance on complex reasoning benchmarks
Abstract
In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗inclusionAI/Ring-mini-linear-2.0model· 721 dl· ♡ 88721 dl♡ 88
- 🤗inclusionAI/Ring-flash-linear-2.0model· 66 dl· ♡ 9866 dl♡ 98
- 🤗inclusionAI/Ring-mini-linear-2.0-GPTQ-int4model· 37 dl· ♡ 1037 dl♡ 10
- 🤗inclusionAI/Ring-flash-linear-2.0-GPTQ-int4model· 25 dl· ♡ 825 dl♡ 8
- 🤗inclusionAI/Ring-flash-linear-2.0-128kmodel· 12 dl· ♡ 2012 dl♡ 20
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics
