ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Hui Sun; Yun-Ji Zhang; Zheng Xie; Ren-Biao Liu; Yali Du; Xin-Ye Li; and Ming Li

arXiv:2604.03922·cs.LG·April 7, 2026

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, and Ming Li

PDF

TL;DR

This paper introduces ACES, a novel leave-one-out AUC-based scoring method for evaluating code generation tests, which improves test reliability without requiring test correctness labels.

Contribution

The paper proposes ACES, a new AUC consistency scoring method that ranks code correctness without circular dependencies, achieving state-of-the-art results.

Findings

01

ACES achieves state-of-the-art Pass@k on multiple benchmarks.

02

ACES variants operate with negligible overhead.

03

The leave-one-out AUC correlates with test ability to distinguish correct from incorrect code.

Abstract

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.