LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari,, Mahyar Najibi

TL;DR
LazyLLM introduces a dynamic token pruning method that selectively computes key-value caches for important tokens, significantly speeding up long-context LLM inference without sacrificing accuracy.
Contribution
It proposes a novel dynamic token pruning approach that improves inference efficiency by selectively computing caches, unlike static pruning methods.
Findings
Accelerates LLama 2 7B model by 2.34x in multi-document QA
Maintains accuracy while reducing computation time
Seamlessly integrates with existing models without fine-tuning
Abstract
The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long prompts, the KV cache must be computed for all tokens during the prefilling stage, which can significantly increase the time needed to generate the first token. Consequently, the prefilling stage may become a bottleneck in the generation process. An open question remains whether all prompt tokens are essential for generating the first token. To answer this, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for the next token prediction in both the prefilling and decoding stages. Contrary to static pruning approaches that prune the prompt at once, LazyLLM allows language models to dynamically select…
Peer Reviews
Decision·Submitted to ICLR 2025
* Evaluated the proposed method in diverse task.
* Problem seems not very general and universal to all context. Authors should be clear about when TTFT becomes x21 compared to decoding. In a large scale system, decoding and prefilling is happening in a different server so it is not a big problem. Also prefilling usually computes more token than decoding so if we normalize the latency by number of tokens, we can’t say it is completely doing wrong although optimizing it helps anyway. * Figures are confusing especially fig 4. * Methods are comp
* **Sharp focus on dynamic token pruning for the prefilling stage**. This paper proposes an innovative approach to tackle the TTFT problem by shifting part of the prompt token computation to the decoding stage. The dynamic token pruning at different decoding steps allows for the selective retention of previously pruned but relevant tokens. * **Effective and flexible layer-wise pruning strategy**. The progressive token pruning from earlier to later layers is well-justified, offering a flexible a
* **Limited detail on hyperparameter settings and implementation strategies**. The approach introduces numerous hyperparameters, particularly with progressive token pruning and token revival, which could impact implementation. Providing additional details on the decision-making process for these hyperparameters would enhance transparency and offer insights into LazyLLM’s effectiveness and generalizability. - Top-$k$ percentile selection strategy: Unless I missed something, it appears that di
A key strength of LazyLLM lies in its dynamic, training-free approach to token pruning, which allows it to be easily integrated into existing transformer-based LLMs without requiring model fine-tuning or architectural changes. By selectively computing only the most important tokens for each generation step, LazyLLM not only optimizes the time-to-first-token (TTFT) but also reduces the overall computation during inference. This results in significant speedups across various tasks and model config
1. **Additional GPU memory usage for Aux Cache**: If the Aux Cache is retained on the GPU, it will increase GPU memory consumption. As a result, the actual GPU memory footprint of LazyLLM should account for both the retained KV cache and the Aux Cache. This design might limit LazyLLM’s applicability in scenarios with high memory demands. 2. **Alignment of GPU memory costs in experiments**: It is crucial to clarify whether the GPU memory usage for each method was fairly aligned in the experiment
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
MethodsLLaMA · Pruning
