# Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction

**Authors:** Xu Guo

arXiv: 2508.20395 · 2025-08-29

## TL;DR

This paper investigates how the utility of intermediate reasoning steps in large language models can be measured using conditional entropy, revealing that decreasing entropy correlates with correct answers and enabling early stopping of unproductive reasoning.

## Contribution

The study introduces a method to quantify reasoning utility in LLMs via conditional entropy, providing insights into when to halt reasoning to improve accuracy.

## Key findings

- Decreasing conditional entropy over reasoning steps correlates with correct answers.
- Incorrect reasoning paths tend to be longer than correct ones.
- Conditional entropy can predict the usefulness of reasoning steps for final accuracy.

## Abstract

Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer's correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision.   We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20395/full.md

## Figures

38 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20395/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/2508.20395/full.md

---
Source: https://tomesphere.com/paper/2508.20395