TL;DR
This paper introduces a step entropy-based compression method for Chain-of-Thought prompting in LLMs, reducing verbosity and inference costs while maintaining reasoning accuracy through a novel training strategy.
Contribution
It proposes a new step entropy metric to identify and prune redundant reasoning steps, and develops a training approach enabling LLMs to generate compressed reasoning chains.
Findings
80% of low-entropy steps can be pruned with minimal accuracy loss
Pruning based on entropy outperforms random or high-entropy pruning
The method improves inference efficiency significantly
Abstract
Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies \emph{the informational contribution of individual reasoning steps} to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80\% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building…
Peer Reviews
Decision·ICLR 2026 Poster
1. Using entropy as information proxy is clean. The paper defines step entropy by summing token entropies within a step and proves the conditional information contribution is bounded by this entropy, offering clear intuition for “skip low-entropy steps.” This is simple, transparent, and theoretically motivated. 2. Theorem 1 provides usable intuition. Bounding the information of any subset of steps by the sum of their entropies gives a direct justification for step-level pruning rather than toke
1. Step segmentation heuristic. Steps are segmented using simple newline heuristics. While this works reasonably, it can blur step boundaries or merge logically distinct thoughts. A sensitivity study with sentence-based or LLM-predicted segmentation would improve robustness. 2. Fixed 80% pruning threshold. The global 80% rule is justified empirically but may not generalize across datasets or reasoning styles. An adaptive or learned κ could better reflect per-problem difficulty. 3. Unnormalized
1. Clarity: The paper is generally written pretty clearly, and the method is not very complex which is nice. I would say that the writing is a little repetitive (saying the same thing many times), but not to the point where it makes the paper hard to understand. 2. Significance: It seems that the method, while simple, is reasonably effective. It greatly compresses the resulting CoT at a reasonable loss in accuracy.
1. To be honest, the method feels rather "hacky" to me, inserting skip tokens based on heuristics. My feeling is that the community in general is trying to move towards methods that perform end-to-end RL in a more principled way rather than these sorts of processes. 2. Relatedly, while this paper proposes methods to compress chains of thought, there are other methods to directly control the length of chains of thought such as L1 (Aggarwal and Welleck). These seem simpler, more elegant, and can a
1. The method is simple and easy to implement, requiring only entropy calculation and pruning based on low-information steps to construct the pruned CoT. It achieves strong performance, such as maintaining accuracy on Math500 even with a 30% compression ratio, demonstrating both effectiveness and efficiency. 2. The introduction of the SFT+RL framework makes the approach more practical. By allowing the model to learn when to skip redundant steps automatically, it extends the static compression me
1. The segmentation and granularity of reasoning steps are not rigorously defined. The approach relies on manually designed delimiters like \n\n, which may not generalize well across datasets or model architectures. 2. The definition of step entropy as the sum rather than the average of token entropies could bias the metric toward longer steps, potentially misrepresenting their true informativeness. 3. The presentation of Table 1 is not so good. A clearer organization would make the results easi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
