Ensuring Functional Correctness of Large Code Models with Selective Generation
Jaewoo Jeong, Taesoo Kim, Sangdon Park

TL;DR
This paper introduces a method to improve the correctness of large code generation models by automatically generating unit tests for evaluating and controlling code quality, reducing hallucinations and increasing safety.
Contribution
It proposes a selective code generation approach that abstains from uncertain outputs based on functional correctness, and introduces FuzzEval, a paradigm using generated unit tests for evaluation and learning.
Findings
Effective control of code hallucination demonstrated
Generated unit tests improve evaluation precision
Selective generation increases safety and efficiency
Abstract
The hallucination of code generation models hinders their applicability to systems requiring higher safety standards. One critical bottleneck in addressing code hallucination is the difficulty of identifying the functional correctness of generated code, due to its unnatural form. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, leveraging the \emph{executable nature} of code. Accordingly, we propose \emph{selective code generator} that abstains from uncertain generations -- based on the functional correctness evaluated by generated unit tests -- to theoretically control the correctness among non-abstained answers, \ie the false discovery rate. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this paradigm \emph{FuzzEval}. We demonstrate the efficacy of our…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Its originality lies in defining α-code entailment (statistical code functional correctness via fuzzing) and extending selective prediction to code with SCG, bridging gaps between natural language entailment and code’s structural complexity. 2. It demonstrates high quality through rigorous theoretical proofs (e.g., FDR-CE controllability guarantees) and well-controlled experiments (5 LCMs, 4 datasets, 50 random splits) that report statistical significance. 3. It maintains clarity by explainin
Admittedly, I'm not an expert in this field, but I figure out these weaknesses with low confidence. 1. It relies on an i.i.d. assumption for FDR-CE guarantees but does not explore mitigations for distribution shift (e.g., unseen code), limiting real-world applicability. 2. Low-quality models (e.g., CodeLlama 13B) fail to meet desired FDR-CE due to uncalibrated scoring functions, yet the paper does not propose strategies to improve SCG for such models. 3. It only briefly compares FuzzEval with L
This paper makes theoretical and methodological contributions to addressing code hallucination in large language models. The introduction of α-code entailment represents a novel and necessary formalization for measuring functional correctness between code snippets, addressing the challenge that code's unnatural structure makes it difficult for humans to verify functional equivalence at scale. Additionally, fuzzing tools, traditionally used for bug detection, are repurposed to automatically gener
1. The computational cost of fuzzing remains unanalyzed. While nₘₐₓ = 150 is specified, the paper provides no wall-clock time measurements, overhead comparisons relative to code generation time, or scalability analysis for larger codebases. The low selection efficiency for weaker models (CodeLlama achieves only ~7% efficiency in Figure 4a) suggests the method may be impractical below certain model quality thresholds. 2. The evaluation scope is limited to simple algorithmic problems in standalon
1. The paper is well-written and analyzes the interesting and important problem of code hallucination in LLMs. 2. It proposes a novel approach (SCG and FuzzEval) for automated unit test generation to evaluate code correctness. 3. The empirical validation is thorough, covering multiple datasets, models, and baselines.
**Major:** 1. The framework's core reliance on a ground-truth "canonical solution" for $\alpha$-equivalence is a significant limitation, as such reference solutions are unavailable in most real-world code generation scenarios. 2. The accuracy of the proposed FDR-CE metric is highly dependent on the quality of the generated unit tests, but the impact of this quality (e.g., test coverage) is not quantified. 3. The evaluation benchmarks consist mainly of simple, stateless algorithmic problems, mak
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Formal Methods in Verification · Logic, programming, and type systems
MethodsSoftmax · Attention Is All You Need
