TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs
Yuxiang Zhang, Zhengxu Yu, Weihang Pan, Zhongming Jin, Qiang Fu, Deng Cai, Binbin Lin, Jieping Ye

TL;DR
TokenSqueeze is a novel method that compresses reasoning paths in large language models, significantly reducing token usage while maintaining high accuracy, thus improving efficiency for reasoning tasks.
Contribution
It introduces an adaptive self-generated data approach and linguistic refinement to preserve reasoning performance during token compression.
Findings
Achieved 50% token reduction on MATH500 benchmark.
Maintained accuracy despite significant token reduction.
Demonstrated effectiveness across diverse reasoning tasks.
Abstract
Emerging reasoning LLMs such as OpenAI-o1 and DeepSeek-R1 have achieved strong performance on complex reasoning tasks by generating long chain-of-thought (CoT) traces. However, these long CoTs result in increased token usage, leading to higher inference latency and memory consumption. As a result, balancing accuracy and reasoning efficiency has become essential for deploying reasoning LLMs in practical applications. Existing long-to-short (Long2Short) methods aim to reduce inference length but often sacrifice accuracy, revealing a need for an approach that maintains performance while lowering token costs. To address this efficiency-accuracy tradeoff, we propose TokenSqueeze, a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data. First, to prevent performance degradation caused by excessive compression of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Semantic Web and Ontologies · Natural Language Processing Techniques
