HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

Yizhao Gao; Jianyu Wei; Qihao Zhang; Yu Cheng; Shimao Chen; Zhengju Tang; Zihan Jiang; Yifan Song; Hailin Zhang; Liang Zhao; Bo Yang; Gang Wang; Shijie Cao; Fuli Luo

arXiv:2602.03560·cs.CL·February 4, 2026

HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

Yizhao Gao, Jianyu Wei, Qihao Zhang, Yu Cheng, Shimao Chen, Zhengju Tang, Zihan Jiang, Yifan Song, Hailin Zhang, Liang Zhao, Bo Yang, Gang Wang, Shijie Cao, Fuli Luo

PDF

Open Access

TL;DR

HySparse introduces a hybrid sparse attention architecture that uses full attention layers as an oracle for token importance and reuses KV caches, significantly improving efficiency and performance in large models.

Contribution

HySparse presents a novel hybrid sparse attention architecture that leverages full attention as an oracle and shares KV caches, addressing key limitations of prior sparse attention methods.

Findings

01

Outperforms full attention and baseline models across various settings.

02

Reduces KV cache storage by nearly 10x in large models.

03

Achieves substantial performance gains with fewer full attention layers.

Abstract

This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer's token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Advanced Data Storage Technologies