Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision
Hong Yi Lin, Chunhua Liu, Haoyu Gao, Patanamon Thongtanunam, Christoph Treude

TL;DR
This paper introduces fine-grained confidence calibration methods for LLMs in automated code revision, improving the reliability of confidence scores for better decision-making.
Contribution
It proposes local Platt-scaling applied to fine-grained confidence scores, enhancing calibration accuracy over traditional global methods in code editing tasks.
Findings
Fine-grained scores achieve lower calibration error.
Calibration improves across multiple tasks and models.
Combining local and global calibration yields the best results.
Abstract
In today's AI-assisted software engineering landscape, developers increasingly depend on LLMs that are highly capable, yet inherently imperfect. The tendency of these models to produce incorrect outputs can reduce developer productivity. To this end, a canonical mitigation method is to provide calibrated confidence scores that faithfully reflect their likelihood of correctness at the instance-level. Such information allows users to make immediate decisions regarding output acceptance, abstain error-prone outputs, and better align their expectations with the model's capabilities. Since post-trained LLMs do not inherently produce well-calibrated confidence scores, researchers have developed post-hoc calibration methods, with global Platt-scaling of sequence-level confidence scores proving effective in many generative software engineering tasks but remaining unreliable or unexplored for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
