Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, Xin Xia

TL;DR
This paper evaluates and improves the confidence estimation of large language models in code reasoning tasks, demonstrating that hybrid strategies significantly enhance confidence reliability and highlighting areas for future improvement.
Contribution
It provides a comprehensive empirical analysis of confidence reliability in LLMs for code reasoning and proposes effective strategies for enhancement.
Findings
DeepSeek-Reasoner outperforms other models in confidence metrics.
Hybrid strategies combining prompt reassessment and calibration improve confidence reliability.
Confidence in complex reasoning tasks remains an area for further improvement.
Abstract
With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to , , and in terms of ECE,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
