Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation
Weilin He, Arindam Sharma, Cristina David

TL;DR
This paper introduces a semantic distance-aware uncertainty estimation method for LLM-based code generation, improving correctness prediction and efficiency without needing model internals or LLM judges.
Contribution
It proposes a novel semantic distance-based uncertainty metric, outperforming existing baselines across multiple models, languages, and settings, and reducing runtime significantly.
Findings
Metrics strongly correlate with correctness across diverse benchmarks.
Outperforms state-of-the-art baselines on multiple models and languages.
Reduces runtime by approximately 48-79% compared to existing methods.
Abstract
LLMs show strong performance in code generation, but their outputs lack correctness guarantees. Sample-based uncertainty estimators address this by generating multiple candidate programs and measuring their disagreement. However, existing estimators make different design choices about how behaviours are identified, aggregated, referenced and compared, making them difficult to assess. We therefore first introduce a taxonomy that disentangles these choices and reveals a missing design point: semantic distance-aware uncertainty estimation, which measures not only whether sampled programs disagree, but how severely their execution behaviours differ. Across LiveCodeBench, MBPP, HumanEval-X and BigCodeBench, spanning Python, Java and C++, our metrics provide strong proxies for correctness, and consistently outperform state-of-the-art sample-based baselines across both closed-source models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
