Rethinking Verification for LLM Code Generation: From Generation to Testing

Zihan Ma; Taolin Zhang; Maosong Cao; Junnan Liu; Wenwei Zhang; Minnan Luo; Songyang Zhang; Kai Chen

arXiv:2507.06920·cs.CL·July 11, 2025

Rethinking Verification for LLM Code Generation: From Generation to Testing

Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper critically examines current LLM code evaluation methods, introduces a new collaborative testing approach called SAGA, and develops TCGBench to improve test coverage and reliability in code generation assessment.

Contribution

It proposes a human-LLM collaborative test-case generation method (SAGA) and a new benchmark (TCGBench) to enhance test coverage and evaluation accuracy for LLM code generation.

Findings

01

SAGA achieves a 90.62% detection rate on TCGBench.

02

Verifier accuracy of generated benchmarks increased by 10.78%.

03

Enhanced evaluation reliability for LLM code generation.

Abstract

Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-compass/saga
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques

MethodsSAGA