Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs

Wentao Ni; Kangqi Zhang; Zhongming Yu; Oren Nelson; Mingu Lee; Hong Cai; Fatih Porikli; Jongryool Kim; Zhijian Liu; Jishen Zhao

arXiv:2602.05191·cs.LG·February 6, 2026

Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs

Wentao Ni, Kangqi Zhang, Zhongming Yu, Oren Nelson, Mingu Lee, Hong Cai, Fatih Porikli, Jongryool Kim, Zhijian Liu, Jishen Zhao

PDF

Open Access

TL;DR

Double-P introduces a hierarchical top-p sparse attention method that adaptively optimizes attention selection, significantly reducing computation overhead and increasing decoding speed in long-context large language models.

Contribution

It proposes a novel hierarchical sparse attention framework that jointly optimizes top-p accuracy, selection overhead, and attention cost for improved efficiency.

Findings

01

Achieves up to 1.8x reduction in attention computation overhead.

02

Delivers up to 1.3x speedup in end-to-end decoding.

03

Maintains near-zero accuracy drop across benchmarks.

Abstract

As long-context inference becomes central to large language models (LLMs), attention over growing key-value caches emerges as a dominant decoding bottleneck, motivating sparse attention for scalable inference. Fixed-budget top-k sparse attention cannot adapt to heterogeneous attention distributions across heads and layers, whereas top-p sparse attention directly preserves attention mass and provides stronger accuracy guarantees. Existing top-p methods, however, fail to jointly optimize top-p accuracy, selection overhead, and sparse attention cost, which limits their overall efficiency. We present Double-P, a hierarchical sparse attention framework that optimizes all three stages. Double-P first performs coarse-grained top-p estimation at the cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Big Data and Digital Economy