ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, and Ming Li

TL;DR
This paper introduces ACES, a novel leave-one-out AUC-based scoring method for evaluating code generation tests, which improves test reliability without requiring test correctness labels.
Contribution
The paper proposes ACES, a new AUC consistency scoring method that ranks code correctness without circular dependencies, achieving state-of-the-art results.
Findings
ACES achieves state-of-the-art Pass@k on multiple benchmarks.
ACES variants operate with negligible overhead.
The leave-one-out AUC correlates with test ability to distinguish correct from incorrect code.
Abstract
Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
