Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction
Xu Guo

TL;DR
This paper investigates how the utility of intermediate reasoning steps in large language models can be measured using conditional entropy, revealing that decreasing entropy correlates with correct answers and enabling early stopping of unproductive reasoning.
Contribution
The study introduces a method to quantify reasoning utility in LLMs via conditional entropy, providing insights into when to halt reasoning to improve accuracy.
Findings
Decreasing conditional entropy over reasoning steps correlates with correct answers.
Incorrect reasoning paths tend to be longer than correct ones.
Conditional entropy can predict the usefulness of reasoning steps for final accuracy.
Abstract
Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer's correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative…
|
|
Prealgebra | Algebra |
|
Precalculus | Geometry | |||||||
| No. of problems | 469 | 540 | 864 | 1185 | 903 | 546 | 479 | ||||||
|
466 | 472 | 357 | 370 | 660 | 780 | 726 | ||||||
|
5 | 5 | 5 | 4 | 7 | 7 | 8 | ||||||
|
1728 | 1565 | 1281 | 1306 | 1860 | 1930 | 1895 | ||||||
|
9 | 10 | 9 | 10 | 11 | 10 | 10 | ||||||
| LLM Accuracy | 0.81 | 0.84 | 0.92 | 0.95 | 0.63 | 0.63 | 0.67 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Semantic Web and Ontologies · Fuzzy Logic and Control Systems
Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction
Xu Guo
KTH Royal Institute of Technology, Sweden
Abstract
Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer’s correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision.
We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model’s uncertainty on the answer span at each reasoning step using conditional entropy with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.
Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction
Xu Guo
KTH Royal Institute of Technology, Sweden
1 Introduction
Recent progress in large language models (LLMs) has been driven largely by scaling model size, yielding unprecedented performance on benchmarks Brown et al. (2020); OpenAI et al. (2024b); Yang et al. (2025). However, scaling benefits show diminishing returns, and further gains from increasing parameters are costly. Meanwhile, such scaling has revealed emergent abilities that can push performance limits without additional training Sun et al. (2024). A notable finding is Chain-of-Thought (CoT) prompting Wei et al. (2022), which prompts the model to always generate intermediate thoughts before providing the answer. The use of CoT has been widely proven effective, inspiring research into advanced reasoning approaches Yao et al. (2023); Xu et al. (2025a, b), agentic frameworks Guo et al. (2024), as well as models that are specially trained to think before answering OpenAI et al. (2024a); DeepSeek-AI et al. (2025).
Yet, unrestricted generation of longer reasoning does not guarantee proportional accuracy gains. Techniques such as self-consistency (Wang et al., 2023) and best-of-N (Lightman et al., 2024) sampling exhibit diminishing returns once the number of sampled chains exceeds a certain threshold (e.g., >40), making them computationally inefficient. If the utility of reasoning for the final answer could be estimated during generation, unproductive reasoning could be halted early, reducing both distractions to the final answer and computational cost.
Prior work on measuring reasoning utility is scarce. Existing approaches fall into two categories: model-based, which rely on a verifier or pretrained Process Reward Model Cobbe et al. (2021); Sun et al. (2024), and metric-based, which use statistical measures such as similarity or confidence. Similarity-based methods primarily target pruning redundancy, which does not directly link to accuracy Zhu et al. (2025); Wang et al. (2025); Xie et al. (2024), as we corroborate in appendix A. Confidence-based metrics, such as perplexity and cross-entropy, are based on , reflecting the model’s confidence in the answer Zhou et al. (2025); Taubenfeld et al. (2025). But our study in appendix B finds limited evidence of its correlation with accuracy. In summary, there is a lack of systematic studies that examined how the model’s uncertainty evolves as reasoning unfolds, and how this trajectory correlates with accuracy.
In this paper, we present an oracle study on the MATH Hendrycks et al. (2021) dataset across seven subjects. We use a reasoning LLM to generate CoT responses, segment them into steps, and compute conditional entropy on the answer span after each step. By grouping the responses into correct and incorrect sets, we plot the entropy trajectories and align them using cubic spline interpolation to reveal general trends across reasoning steps.
Our results on GPT-4o and Qwen2.5-32B reveal a strong correlation: when conditional entropy decreases over steps, the model is more likely to produce the correct answer; when it does not decrease, errors are more common. Also, incorrect reasoning paths are longer on average, indicating that more steps do not necessarily improve outcomes. These findings validate conditional entropy as a promising metric for reasoning utility, providing a foundation for future work on efficient reasoning pipelines that detect and avoid unproductive reasoning early.
2 Preliminary
We consider an autoregressive language model with vocabulary . Given a natural language prompt , the tokenizer first maps it into a sequence of token IDs , where . Each token ID is mapped to an embedding vector through the embedding table , producing the embedded input sequence
[TABLE]
In autoregressive generation, the LLM predicts the next token distribution conditioned on a context . Let denote the context at step , which initially consists of and grows as new tokens are generated. For -th token, the model outputs
[TABLE]
where and denotes all previously generated tokens. Generation stops when an end-of-sequence token is produced or a predefined maximum length is reached.
Output structure. In our setting, the model’s complete response sequence is denoted as
[TABLE]
where is the reasoning chain, consisting of intermediate steps, explanations, or derivations. is the final answer span. This explicit decomposition allows us to analyze how the reasoning prefix affects the model’s certainty about the final answer .
3 Measuring the Utility of Reasoning
Given that both the reasoning chain and the final answer are conditioned on the same input , the joint distribution factorizes as:
[TABLE]
Our key question is: “Given and a reasoning chain , how much additional information does provide about the correct answer beyond what is already contained in ?"
Information-theoretic definition: We formalize the utility of via conditional mutual information:
[TABLE]
where denotes entropy. A large reduction in conditional entropy indicates that the reasoning chain significantly narrows down the correct answer space, while means adds little useful information beyond the question itself.
Practical estimation via conditional entropy: Directly computing and is intractable for LLMs. We therefore estimate them using teacher forcing: For each token position in a target answer sequence and for a given context at -th reasoning step, we feed the LLM to obtain
[TABLE]
The token-level entropy at position is:
[TABLE]
Since the ground-truth is a sequence, we define the sequence-level conditional entropy as the average token entropy:
[TABLE]
This conditional entropy directly reflects the model’s average uncertainty about given . Notably, this measurement relies solely on the model’s own output probabilities and requires no additional reward model or external supervision.
Entropy trajectory during reasoning: To study how reasoning affects answer certainty, we decompose into steps and compute for each , where denotes the first reasoning steps. This yields an entropy trajectory that reflects how uncertainty changes as reasoning unfolds. The information gain at -th reasoning step is therefore estimated by .
4 Experiment
Dataset.
We use the MATH dataset Hendrycks et al. (2021), which spans seven math domains, across five difficulty levels. Each problem is provided with a human step-by-step solution. Summary statistics are shown in Table 1.
Oracle study setting.
For data generation and inspection, we employ different LLMs. Specifically, we use GPT-4o and Qwen2.5-32B to generate model-produced solutions for each problem and compare them with the human references. To measure uncertainty, we use Qwen3-8B as an inspector that computes answer-span conditional entropy under teacher forcing. We simulate autoregressive generation, and at each token position in the answer span, we use the last-layer last-token hidden state to obtain the logits for computing the entropy.
Data Processing.
Every reasoning chain can have a different number of steps, making direct averaging on the same problem set infeasible. For each domain, we align entropy trajectories by resampling each chain to a common target steps ( for human and for LLM). We use cubic splines for resampling (switch to linear interpolation when steps ) to place all curves on a shared horizontal axis. This up-/down-sampling preserves the overall shape of each trajectory while giving all curves the same number of points. After alignment, we average entropy at each normalised step to obtain domain-level trends, as shown in Figure 1.
5 Results Discussion and Opportunities
This oracle study presents the following insights. Reasoning paths that lead to correct answers consistently achieve information gain over steps. Across seven math categories, the answer-span entropy for correct LLM chains (—) consistently decreases, falls faster, and remains below that of incorrect chains (—). In contrast, incorrect chains show no consistent decrease: on lower-accuracy categories (Fig.1(a)-1(d)) entropy is typically flat or even rises, while on higher-accuracy categories (Fig.1(e)-1(h)) it may decrease yet stays above the correct trajectories. These patterns suggest a simple selection heuristic: prune chains whose entropy does not decrease; if all decrease, rank by the magnitude of the negative slope and keep the top-.
Reasoning paths that lead to wrong answers are typically longer. Across the same problem sets, LLM-generated chains (solid) are generally longer than human chains (dashed), with GPT-4o (Fig.1(a)) exhibiting longer chains than Qwen2.5-32B (Fig.1(b)). Incorrect chains (red) are significantly longer than correct ones (blue). These findings suggest that LLMs often fail on problems that require more steps, and that tackling hard problems requires more advanced reasoning strategies instead of simply scaling the number of steps.
Information gains diminish faster for human reasoning than for LLM reasoning. In human solutions, entropy drops sharply in the early steps and then plateaus; by contrast, correct LLM trajectories (solid blue) decrease entropy more gradually and uniformly across steps. This pattern suggests a strategic difference: humans quickly narrow the answer space and then unravel the intermediate steps, whereas LLMs show weaker early narrowing and progress step by step until the answer becomes clear, as illustrated by an example in Appendix C. This suggests that improving LLMs’ ability to narrow the answer space early, instead of planning step by step, could facilitate efficient reasoning.
Limitations
While our study supports that entropy trajectories provide a model-internal, training-free signal to assess reasoning utility, several limitations remain:
- •
Our analysis relies on access to ground-truth answers, which enables controlled comparison between human and model trajectories. In real deployment, however, ground truth is unavailable, so entropy-based signals must be interpreted without oracle guidance. This limits the direct applicability of our findings, although they still provide useful guidance for pruning parallel scaling strategies (e.g., self-consistency, best-of-) where answer consensus is the main signal.
- •
Experiments are conducted on the MATH dataset. While math problems are a natural testbed for reasoning, we need more dataset evaluations to support whether the observed entropy dynamics generalize to other domains, such as commonsense reasoning.
- •
We use Qwen3-8B as the inspector model. We need more evaluations on different architectures or sizes to confirm the generalizability of the entropy dynamics.
Appendix A Semantic similarity trajectories for Qwen2.5-32B
We study whether cosine similarity between the evolving context and the gold answer span separates correct from incorrect reasoning chains (Figure 2). At each reasoning step , we form and compute mean-pooled token representations for and for the answer span . The cosine similarity at step- is s_{k}=\cos\!\big{(}\bar{\mathbf{e}}(C_{k}),\,\bar{\mathbf{e}}(Y)\big{)}.
We plot for correct and incorrect chains using the same cubic-spline interpolation to reveal general trends. We find that (1) Similarity trajectories show limited separability between correct and incorrect chains, indicating weak predictive power for accuracy. (2) Nonetheless, generally increases with : as reasoning unfolds, the context representation moves closer to the answer-span representation, though the answer is a short numeric expression. This suggests that the semantics of the natural-language derivation for a math problem become progressively closer to the target number in the model’s representation space, although this alignment alone does not determine correctness.
Appendix B Cross-entropy trajectories for Qwen2.5-32B
We also report the cross-entropy trajectory during reasoning (Figure 3). At each reasoning step , we form and compute:
[TABLE]
The key difference between cross-entropy and entropy is that cross-entropy only aggregates for the ground-truth tokens, while entropy averages self-information across all . Other commonly used metrics are directly derived from cross-entropy: perplexity is its exponential, and confidence corresponds to for the ground-truth token.
Figure 3 shows that cross-entropy tends to plateau as reasoning unfolds, both for humans and LLMs. Interestingly, we do not observe a consistent difference between correct and incorrect chains. Moreover, LLMs often display lower cross-entropy (i.e., higher confidence) on incorrect chains. This suggests a weak link between confidence and correctness: higher confidence does not reliably imply correctness, raising concerns about confidence-based reasoning evaluation.
Appendix C Examples for human solution and LLM solution
Humans figure out the answer space early; LLMs take it step by step. Here, we illustrate a Number Theory problem reasoning process below, showing the strategic difference in humans and LLMs. This difference is reflected in the entropy trajectories in Figure 1 and we discussed this insight in Section 5.
Number Theory Problem
A positive multiple of 45 less than 1000 is randomly selected. What is the probability that it is a two-digit integer? Express your answer as a common fraction.
Below is the Human Solution:
The positive multiples of 45 are
[TABLE]
There are 22 multiples on this list. Every positive multiple of 45 less than 1000 is either a two-digit integer or a three-digit integer. Out of the two-digit integers, and are multiples of 45. Therefore, the probability that the selected multiple of 45 has two digits is .
Below is the Solution generated by Qwen2.5-34B:
Let’s solve this math problem step by step. Write the answer in .
Problem: A positive multiple of 45 less than 1000 is randomly selected. What is the probability that it is a two-digit integer? Express your answer as a common fraction.
Solution: To determine the probability that a randomly selected positive multiple of 45 less than 1000 is a two-digit integer, we need to follow these steps:
- Identify the range of multiples of 45 less than 1000:
- •
The smallest multiple of 45 is .
- •
The largest multiple of 45 less than 1000 can be found by dividing 999 by 45 and taking the integer part:
[TABLE]
- •
Therefore, the largest multiple of 45 less than 1000 is .
- Count the total number of multiples of 45 less than 1000:
- •
These multiples form an arithmetic sequence: .
- •
The number of terms in this sequence is given by:
[TABLE]
- Identify the two-digit multiples of 45:
- •
The smallest two-digit multiple of 45 is .
- •
The largest two-digit multiple of 45 is .
- •
The two-digit multiples of 45 are and .
- Count the number of two-digit multiples of 45:
- •
There are 2 two-digit multiples of 45.
- Calculate the probability:
- •
The probability is the ratio of the number of two-digit multiples to the total number of multiples of 45 less than 1000:
[TABLE]
Thus, the probability that a randomly selected positive multiple of 45 less than 1000 is a two-digit integer is .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. Language models are few-shot learners . Preprint , ar Xiv:2005.14165.
- 2Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems . Preprint , ar Xiv:2110.14168.
- 3Deep Seek-AI et al. (2025) Deep Seek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, and 181 others. 2025. Deepseek-v 3 technical report . Preprint , ar Xiv:2412.19437.
- 4Guo et al. (2024) Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges . Preprint , ar Xiv:2402.01680.
- 5Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset . In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) .
- 6Lightman et al. (2024) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let’s verify step by step . In The Twelfth International Conference on Learning Representations .
- 7Open AI et al. (2024 a) Open AI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, and 244 others. 2024 a. Openai o 1 system card . Preprint , ar Xiv:2412.16720.
- 8Open AI et al. (2024 b) Open AI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, and 262 others. 2024 b. Gpt-4 technical report . Preprint , ar Xiv:2303.08774.
