Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference

Zimeng Wu; Donghao Wang; Chaozhe Jin; Jiaxin Chen; Yunhong Wang

arXiv:2601.13155·cs.CL·February 3, 2026

Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference

Zimeng Wu, Donghao Wang, Chaozhe Jin, Jiaxin Chen, Yunhong Wang

PDF

Open Access

TL;DR

This paper introduces Self-Predictive Token Skipping (SPTS), a training-free method that improves long-context LLM inference efficiency by selectively skipping tokens using probing strategies, achieving significant speedups with maintained accuracy.

Contribution

The paper proposes a novel, training-free token skipping framework with probing strategies and multi-stage pruning, addressing structure optimization and redundancy issues in long-context LLM inference.

Findings

01

Achieves up to 2.46× speedup in prefilling

02

Achieves up to 2.29× speedup in end-to-end generation

03

Maintains state-of-the-art accuracy

Abstract

Long-context inference enhances the reasoning capability of Large Language Models (LLMs), but incurs significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown great promise in reducing inference latency, yet still suffer from inherently insufficient structure optimization, outdated selection criteria, and redundancy interference, resulting in suboptimal speed-accuracy trade-off. To address these issues, we propose a novel training-free framework dubbed Self-Predictive Token Skipping (SPTS), for efficient long-context LLM inference. Specifically, motivated by probing the influence of target layers prior to skipping, we design two selective token skipping strategies for typical structures, including Partial Attention Probing (PAP) for multi-head attention and Low-rank Transformation Probing (LTP) for feed forward network. The former selects…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Natural Language Processing Techniques