Ensuring Functional Correctness of Large Code Models with Selective Generation

Jaewoo Jeong; Taesoo Kim; Sangdon Park

arXiv:2505.13553·cs.SE·October 27, 2025

Ensuring Functional Correctness of Large Code Models with Selective Generation

Jaewoo Jeong, Taesoo Kim, Sangdon Park

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a method to improve the correctness of large code generation models by automatically generating unit tests for evaluating and controlling code quality, reducing hallucinations and increasing safety.

Contribution

It proposes a selective code generation approach that abstains from uncertain outputs based on functional correctness, and introduces FuzzEval, a paradigm using generated unit tests for evaluation and learning.

Findings

01

Effective control of code hallucination demonstrated

02

Generated unit tests improve evaluation precision

03

Selective generation increases safety and efficiency

Abstract

The hallucination of code generation models hinders their applicability to systems requiring higher safety standards. One critical bottleneck in addressing code hallucination is the difficulty of identifying the functional correctness of generated code, due to its unnatural form. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, leveraging the \emph{executable nature} of code. Accordingly, we propose \emph{selective code generator} that abstains from uncertain generations -- based on the functional correctness evaluated by generated unit tests -- to theoretically control the correctness among non-abstained answers, \ie the false discovery rate. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this paradigm \emph{FuzzEval}. We demonstrate the efficacy of our…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

1. Its originality lies in defining α-code entailment (statistical code functional correctness via fuzzing) and extending selective prediction to code with SCG, bridging gaps between natural language entailment and code’s structural complexity. 2. It demonstrates high quality through rigorous theoretical proofs (e.g., FDR-CE controllability guarantees) and well-controlled experiments (5 LCMs, 4 datasets, 50 random splits) that report statistical significance. 3. It maintains clarity by explainin

Weaknesses

Admittedly, I'm not an expert in this field, but I figure out these weaknesses with low confidence. 1. It relies on an i.i.d. assumption for FDR-CE guarantees but does not explore mitigations for distribution shift (e.g., unseen code), limiting real-world applicability. 2. Low-quality models (e.g., CodeLlama 13B) fail to meet desired FDR-CE due to uncalibrated scoring functions, yet the paper does not propose strategies to improve SCG for such models. 3. It only briefly compares FuzzEval with L

Reviewer 02Rating 4Confidence 4

Strengths

This paper makes theoretical and methodological contributions to addressing code hallucination in large language models. The introduction of α-code entailment represents a novel and necessary formalization for measuring functional correctness between code snippets, addressing the challenge that code's unnatural structure makes it difficult for humans to verify functional equivalence at scale. Additionally, fuzzing tools, traditionally used for bug detection, are repurposed to automatically gener

Weaknesses

1. The computational cost of fuzzing remains unanalyzed. While nₘₐₓ = 150 is specified, the paper provides no wall-clock time measurements, overhead comparisons relative to code generation time, or scalability analysis for larger codebases. The low selection efficiency for weaker models (CodeLlama achieves only ~7% efficiency in Figure 4a) suggests the method may be impractical below certain model quality thresholds. 2. The evaluation scope is limited to simple algorithmic problems in standalon

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper is well-written and analyzes the interesting and important problem of code hallucination in LLMs. 2. It proposes a novel approach (SCG and FuzzEval) for automated unit test generation to evaluate code correctness. 3. The empirical validation is thorough, covering multiple datasets, models, and baselines.

Weaknesses

**Major:** 1. The framework's core reliance on a ground-truth "canonical solution" for $\alpha$-equivalence is a significant limitation, as such reference solutions are unavailable in most real-world code generation scenarios. 2. The accuracy of the proposed FDR-CE metric is highly dependent on the quality of the generated unit tests, but the impact of this quality (e.g., test coverage) is not quantified. 3. The evaluation benchmarks consist mainly of simple, stateless algorithmic problems, mak

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Formal Methods in Verification · Logic, programming, and type systems

MethodsSoftmax · Attention Is All You Need