Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

Ling Team; Bin Han; Caizhi Tang; Chen Liang; Donghao Zhang; Fan Yuan; Feng Zhu; Jie Gao; Jingyu Hu; Longfei Li; Meng Li; Mingyang Zhang; Peijie Jiang; Peng Jiao; Qian Zhao; Qingyuan Yang; Wenbo Shen; Xinxing Yang; Yalin Zhang; Yankun Ren; Yao Zhao; Yibo Cao; Yixuan Sun; Yue Zhang; Yuchen Fang; Zibin Lin; Zixuan Cheng; Jun Zhou

arXiv:2510.19338·cs.LG·October 24, 2025

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun

PDF

Open Access 5 Models

TL;DR

This paper introduces the Ring-linear model series, a hybrid attention architecture that significantly reduces inference costs for long-context reasoning while maintaining state-of-the-art performance, and improves training efficiency with specialized hardware.

Contribution

The paper presents a novel hybrid attention architecture combining linear and softmax attention, optimized for long-context inference, with systematic exploration of attention ratios and high-performance training tools.

Findings

01

Reduces inference cost to 1/10 of dense models

02

Achieves over 50% reduction in inference cost compared to previous Ring models

03

Maintains state-of-the-art performance on complex reasoning benchmarks

Abstract

In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics