TokenSkip: Controllable Chain-of-Thought Compression in LLMs
Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, Wenjie Li

TL;DR
TokenSkip is a method that selectively skips less important tokens in chain-of-thought outputs of large language models, reducing inference latency and token usage while maintaining reasoning accuracy.
Contribution
It introduces TokenSkip, a novel approach for controllable compression of chain-of-thought sequences in LLMs based on token importance analysis.
Findings
Reduces reasoning tokens by 40% on GSM8K.
Maintains reasoning performance with less than 0.4% accuracy drop.
Effective across various models and tasks.
Abstract
Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT sequences during inference could further boost LLM reasoning performance. However, due to the autoregressive nature of LLM decoding, longer CoT outputs lead to a linear increase in inference latency, adversely affecting user experience, particularly when the CoT exceeds 10,000 tokens. To address this limitation, we analyze the semantic importance of tokens within CoT outputs and reveal that their contributions to reasoning vary. Building on this insight, we propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. Extensive experiments across various models and tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Advanced Data Storage Technologies · Innovative Microfluidic and Catalytic Techniques Innovation
