Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference
Zimeng Wu, Donghao Wang, Chaozhe Jin, Jiaxin Chen, Yunhong Wang

TL;DR
This paper introduces Self-Predictive Token Skipping (SPTS), a training-free method that improves long-context LLM inference efficiency by selectively skipping tokens using probing strategies, achieving significant speedups with maintained accuracy.
Contribution
The paper proposes a novel, training-free token skipping framework with probing strategies and multi-stage pruning, addressing structure optimization and redundancy issues in long-context LLM inference.
Findings
Achieves up to 2.46× speedup in prefilling
Achieves up to 2.29× speedup in end-to-end generation
Maintains state-of-the-art accuracy
Abstract
Long-context inference enhances the reasoning capability of Large Language Models (LLMs), but incurs significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown great promise in reducing inference latency, yet still suffer from inherently insufficient structure optimization, outdated selection criteria, and redundancy interference, resulting in suboptimal speed-accuracy trade-off. To address these issues, we propose a novel training-free framework dubbed Self-Predictive Token Skipping (SPTS), for efficient long-context LLM inference. Specifically, motivated by probing the influence of target layers prior to skipping, we design two selective token skipping strategies for typical structures, including Partial Attention Probing (PAP) for multi-head attention and Low-rank Transformation Probing (LTP) for feed forward network. The former selects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Natural Language Processing Techniques
