When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems

Young Hyun Cho; Will Wei Sun

arXiv:2605.12947·stat.ML·May 14, 2026

When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems

Young Hyun Cho, Will Wei Sun

PDF

TL;DR

This paper introduces an always-valid statistical wrapper for AI workflows that helps decide when to release outputs by reliably monitoring iterative generate-evaluate-revise processes, ensuring correctness and reducing premature releases.

Contribution

It proposes a novel release decision method using a reference pool and e-process, providing finite-sample control and applicability to black-box AI systems.

Findings

01

Reduces premature incorrect releases in case studies.

02

Provides finite-sample control of release decisions.

03

Achieves reliable release on feasible tasks.

Abstract

LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This raises a statistical challenge because deployment-time evaluator scores are adaptively generated and repeatedly monitored, yet the likelihood models or exchangeability assumptions typically used for calibration are unavailable. We propose an always-valid release wrapper for existing generator-evaluator pipelines. The wrapper builds a hard-negative reference pool of high-scoring failures, calibrates deployment-time evaluator scores against this pool, and accumulates the resulting evidence with an e-process. This separates two roles: the reference pool turns black-box scores into conservative evidence, while the e-process provides validity under optional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.