The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion
Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, Panos Louridas

TL;DR
This paper evaluates the confidence of large language models in code completion by analyzing perplexity across languages, models, and datasets, revealing language and model-dependent variations.
Contribution
It introduces a systematic analysis of LLM confidence in code generation using perplexity metrics across multiple languages and models, providing practical insights for developers.
Findings
Strongly-typed languages have lower perplexity than dynamically typed languages.
Scripting languages show higher perplexity, indicating lower model confidence.
Perplexity varies with the choice of LLM and language, affecting code completion reliability.
Abstract
Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, code completion has been approached with diverse LLMs fine-tuned on code (code LLMs). The performance of code LLMs can be assessed with downstream and intrinsic metrics. Downstream metrics are usually employed to evaluate the practical utility of a model, but can be unreliable and require complex calculations and domain-specific knowledge. In contrast, intrinsic metrics such as perplexity, entropy, and mutual information, which measure model confidence or uncertainty, are simple, versatile, and universal across LLMs and tasks, and can serve as proxies for functional correctness and hallucination risk in LLM-generated code. Motivated by this, we evaluate the confidence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
