Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding
Feiyu Yao, Qian Wang

TL;DR
LFPS is a novel method that accelerates sparse indexing in large language model decoding by leveraging historical attention patterns, significantly reducing computation and memory costs while maintaining accuracy.
Contribution
LFPS introduces a dynamic sparse indexing approach that exploits temporal attention correlations, enabling faster decoding in long-context LLMs without accuracy loss.
Findings
Achieves up to 22.8× speedup over full attention.
Attains 9.6× speedup over exact Top-k retrieval.
Maintains generation accuracy with significant efficiency gains.
Abstract
As large language models (LLMs) continue to support increasingly longer contexts, the memory demand for key-value (KV) caches during decoding grows rapidly, becoming a critical bottleneck in both GPU memory capacity and PCIe bandwidth. Sparse attention mechanisms alleviate this issue by computing attention weights only for selected key-value pairs. However, their indexing computation typically requires traversing all key vectors, resulting in significant computational and data transfer overhead. To reduce the cost of index retrieval, existing methods often treat each decoding step as an independent process, failing to exploit the temporal correlations embedded in historical decoding information. To this end, we propose LFPS(Learn From the Past for Sparse Indexing), an acceleration method that dynamically constructs sparse indexing candidates based on historical attention patterns. LFPS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Topic Modeling
