Multicalibration for LLM-based Code Generation
Viola Campos, Robin Kuschnereit, Adrian Ulges

TL;DR
This paper explores multicalibration techniques to improve the confidence calibration of code-generating language models, leading to more accurate likelihood estimates and better code correctness predictions.
Contribution
It introduces and evaluates multicalibration methods for code LLMs, demonstrating their effectiveness over existing calibration approaches on multiple benchmarks.
Findings
Multicalibration improves skill scores by +1.03 over uncalibrated likelihoods.
Baseline calibration methods improve skill scores by +0.37.
The dataset with code generations, likelihoods, and correctness labels is made publicly available.
Abstract
As AI-based code generation becomes widespread, researchers are investigating the calibration of code LLMs - ensuring their confidence scores faithfully represent the true likelihood of code correctness. To do so, we investigate multicalibration, which can capture additional factors about a coding problem, such as complexity, code length, or programming language used. We study four multicalibration approaches on three function synthesis benchmarks, using latest-generation code LLMs (Qwen3 Coder, GPT-OSS, DeepSeek-R1-Distill). Our results demonstrate that multicalibration can yield distinct improvements over both uncalibrated token likelihoods (+1.03 in skill score) and baseline calibrations (+0.37 in skill score). We study the influence of the aforementioned factors in ablations, and make our dataset (consisting of code generations, likelihoods, and correctness labels) available for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques · Software Testing and Debugging Techniques
