Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Dhruv Deshmukh; Saurabh Goyal; Nipun Kwatra; Ramachandran Ramjee

arXiv:2512.16391·cs.LG·December 19, 2025

Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee

PDF

Open Access

TL;DR

Kascade introduces a training-free sparse attention method that significantly speeds up long-context LLM inference by reusing high-weight key indices across layers, maintaining accuracy while reducing latency.

Contribution

It proposes a novel, layer-stable, sparse attention technique that leverages known observations and dynamic programming for efficient, accurate long-context inference in LLMs.

Findings

01

Achieves up to 4.1x speedup in decode attention.

02

Achieves up to 2.2x speedup in prefill attention.

03

Maintains accuracy on long-context benchmarks.

Abstract

Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Machine Learning in Healthcare · Domain Adaptation and Few-Shot Learning