Does In-IDE Calibration of Large Language Models work at Scale?
Roham Koohestani, Agnia Sergeyuk, David Gros, Claudio Spiess, Sergey Titov, Prem Devanbu, Maliheh Izadi

TL;DR
This study evaluates the effectiveness of in-IDE calibration of large language models for code generation, finding limited success with general calibration methods but highlighting user preferences for non-numerical reliability signals.
Contribution
The paper introduces a scalable calibration framework for code models and assesses its effectiveness at scale, along with human-centered design principles for conveying reliability signals.
Findings
General post-hoc calibration does not significantly improve confidence alignment.
Personalized calibration effectiveness depends on user interaction volume.
Developers prefer non-numerical, color-coded reliability indicators.
Abstract
The introduction of large language models into integrated development environments (IDEs) is revolutionizing software engineering, yet it poses challenges to the usefulness and reliability of Artificial Intelligence-generated code. Post-hoc calibration of internal model confidences aims to align probabilities with an acceptability measure. Prior work suggests calibration can improve alignment, but at-scale evidence is limited. In this work, we investigate the feasibility of applying calibration of code models to an in-IDE context. We study two aspects of the problem: (1) the technical method for implementing confidence calibration and improving the reliability of code generation models, and (2) the human-centered design principles for effectively communicating reliability signal to developers. First, we develop a scalable and flexible calibration framework which can be used to obtain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
