AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
Pouya Pezeshkpour, Estevam Hruschka

TL;DR
AutoPyVerifier automatically synthesizes minimal, interpretable Python verifiers for LLM outputs, improving verification accuracy and providing structural checks across diverse benchmarks.
Contribution
It introduces a framework that uses an LLM to generate and refine executable verifiers via DAG search, enabling compact and effective verification sets.
Findings
AutoPyVerifier improves target-objective prediction by up to 55.0 F1 points.
The learned verifier sets become more structural and semantically grounded.
Using the verifier set as an external tool boosts downstream accuracy by up to 17.0 points.
Abstract
Verification is becoming central to both reinforcement-learning-based training and inference-time control of large language models (LLMs). Yet current verifiers face a fundamental trade-off: LLM-based verifiers are expressive but hard to control and prone to error, while deterministic executable verifiers are reliable and interpretable but often limited in capability. We study the following question: given a development set of LLM outputs and labels for a target objective, such as correctness, can we automatically induce a minimal set of Python verifiers whose joint satisfaction closely matches that objective? We propose AutoPyVerifier, a framework that uses an LLM to synthesize candidate verifier functions and then refines them through search over a directed acyclic graph (DAG). By navigating the DAG, AutoPyVerifier systematically explores the space of deterministic executable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
