SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
Lingkun Long, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang

TL;DR
SlimInfer introduces a dynamic token pruning framework that accelerates long-context LLM inference by removing redundant tokens during processing, leveraging information diffusion to maintain semantic integrity and significantly reduce latency.
Contribution
It proposes a novel fine-grained, layer-wise token pruning method for LLM inference that improves speed and efficiency without performance loss.
Findings
Achieves up to 2.53x speedup in time-to-first-token
Reduces end-to-end latency by 1.88x on LLaMA3.1-8B-Instruct
Maintains performance on LongBench benchmarks
Abstract
Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
