Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Shufan Wang; Xing Hu; Junkai Chen; Zhiyuan Pan; Xin Xia

arXiv:2511.02197·cs.SE·November 5, 2025

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, Xin Xia

PDF

Open Access

TL;DR

This paper evaluates and improves the confidence estimation of large language models in code reasoning tasks, demonstrating that hybrid strategies significantly enhance confidence reliability and highlighting areas for future improvement.

Contribution

It provides a comprehensive empirical analysis of confidence reliability in LLMs for code reasoning and proposes effective strategies for enhancement.

Findings

01

DeepSeek-Reasoner outperforms other models in confidence metrics.

02

Hybrid strategies combining prompt reassessment and calibration improve confidence reliability.

03

Confidence in complex reasoning tasks remains an area for further improvement.

Abstract

With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to $0.680$ , $0.636$ , and $13.652$ in terms of ECE,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques