Scaling Linear Attention with Sparse State Expansion

Yuqi Pan; Yongqi An; Zheng Li; Yuhong Chou; Ruijie Zhu; Xiaohui Wang; Mingxuan Wang; Jinqiao Wang; Guoqi Li

arXiv:2507.16577·cs.LG·October 2, 2025

Scaling Linear Attention with Sparse State Expansion

Yuqi Pan, Yongqi An, Zheng Li, Yuhong Chou, Ruijie Zhu, Xiaohui Wang, Mingxuan Wang, Jinqiao Wang, Guoqi Li

PDF

Open Access

TL;DR

This paper introduces Sparse State Expansion (SSE), a novel method for linear attention that improves long-context modeling by expanding state representations through sparse classification, achieving state-of-the-art reasoning performance.

Contribution

The paper proposes SSE, a new sparse state expansion technique that enhances linear attention models for long-context tasks, with efficient implementation and superior reasoning results.

Findings

01

SSE improves retrieval and reasoning in language models.

02

The 2B SSE-H model achieves top reasoning scores among small models.

03

SSE scales favorably with increased state size.

Abstract

The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top- $k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Blind Source Separation Techniques · Neural Networks and Reservoir Computing