SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Lingkun Long; Rubing Yang; Yushi Huang; Desheng Hui; Ao Zhou; Jianlei Yang

arXiv:2508.06447·cs.CL·November 25, 2025

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Lingkun Long, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang

PDF

Open Access

TL;DR

SlimInfer introduces a dynamic token pruning framework that accelerates long-context LLM inference by removing redundant tokens during processing, leveraging information diffusion to maintain semantic integrity and significantly reduce latency.

Contribution

It proposes a novel fine-grained, layer-wise token pruning method for LLM inference that improves speed and efficiency without performance loss.

Findings

01

Achieves up to 2.53x speedup in time-to-first-token

02

Reduces end-to-end latency by 1.88x on LLaMA3.1-8B-Instruct

03

Maintains performance on LongBench benchmarks

Abstract

Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis