Multicalibration for LLM-based Code Generation

Viola Campos; Robin Kuschnereit; Adrian Ulges

arXiv:2512.08810·cs.SE·December 10, 2025

Multicalibration for LLM-based Code Generation

Viola Campos, Robin Kuschnereit, Adrian Ulges

PDF

Open Access 1 Datasets

TL;DR

This paper explores multicalibration techniques to improve the confidence calibration of code-generating language models, leading to more accurate likelihood estimates and better code correctness predictions.

Contribution

It introduces and evaluates multicalibration methods for code LLMs, demonstrating their effectiveness over existing calibration approaches on multiple benchmarks.

Findings

01

Multicalibration improves skill scores by +1.03 over uncalibrated likelihoods.

02

Baseline calibration methods improve skill scores by +0.37.

03

The dataset with code generations, likelihoods, and correctness labels is made publicly available.

Abstract

As AI-based code generation becomes widespread, researchers are investigating the calibration of code LLMs - ensuring their confidence scores faithfully represent the true likelihood of code correctness. To do so, we investigate multicalibration, which can capture additional factors about a coding problem, such as complexity, code length, or programming language used. We study four multicalibration approaches on three function synthesis benchmarks, using latest-generation code LLMs (Qwen3 Coder, GPT-OSS, DeepSeek-R1-Distill). Our results demonstrate that multicalibration can yield distinct improvements over both uncalibrated token likelihoods (+1.03 in skill score) and baseline calibrations (+0.37 in skill score). We study the influence of the aforementioned factors in ablations, and make our dataset (consisting of code generations, likelihoods, and correctness labels) available for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lavis-nlp/CALIBRI
dataset· 306 dl
306 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Software Testing and Debugging Techniques