Adaptive auditing of AI systems with anytime-valid guarantees

Siyu Zhou; Patrick Vossler; Venkatesh Sivaraman; Yifan Mai; Jean Feng

arXiv:2605.07002·cs.AI·May 11, 2026

Adaptive auditing of AI systems with anytime-valid guarantees

Siyu Zhou, Patrick Vossler, Venkatesh Sivaraman, Yifan Mai, Jean Feng

PDF

TL;DR

This paper introduces a statistically rigorous framework for adaptive auditing of AI systems using anytime-valid inference, enabling reliable conclusions from limited, adaptively sampled data.

Contribution

It develops a novel hypothesis testing approach based on Safe Anytime-Valid Inference, addressing challenges of adaptive testing in AI system evaluation.

Findings

01

Maintains anytime-valid type-I error control in adaptive audits.

02

Outperforms pre-specified testing methods in empirical evaluations.

03

Reaches rigorous conclusions with as few as 20 observations.

Abstract

A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two 'dueling' perspectives: (i) the model's null that asserts there is no failure mode with performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.