Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee

TL;DR
Kascade introduces a training-free sparse attention method that significantly speeds up long-context LLM inference by reusing high-weight key indices across layers, maintaining accuracy while reducing latency.
Contribution
It proposes a novel, layer-stable, sparse attention technique that leverages known observations and dynamic programming for efficient, accurate long-context inference in LLMs.
Findings
Achieves up to 4.1x speedup in decode attention.
Achieves up to 2.2x speedup in prefill attention.
Maintains accuracy on long-context benchmarks.
Abstract
Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning in Healthcare · Domain Adaptation and Few-Shot Learning
