When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems
Young Hyun Cho, Will Wei Sun

TL;DR
This paper introduces an always-valid statistical wrapper for AI workflows that helps decide when to release outputs by reliably monitoring iterative generate-evaluate-revise processes, ensuring correctness and reducing premature releases.
Contribution
It proposes a novel release decision method using a reference pool and e-process, providing finite-sample control and applicability to black-box AI systems.
Findings
Reduces premature incorrect releases in case studies.
Provides finite-sample control of release decisions.
Achieves reliable release on feasible tasks.
Abstract
LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This raises a statistical challenge because deployment-time evaluator scores are adaptively generated and repeatedly monitored, yet the likelihood models or exchangeability assumptions typically used for calibration are unavailable. We propose an always-valid release wrapper for existing generator-evaluator pipelines. The wrapper builds a hard-negative reference pool of high-scoring failures, calibrates deployment-time evaluator scores against this pool, and accumulates the resulting evidence with an e-process. This separates two roles: the reference pool turns black-box scores into conservative evidence, while the e-process provides validity under optional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
