Adaptive auditing of AI systems with anytime-valid guarantees
Siyu Zhou, Patrick Vossler, Venkatesh Sivaraman, Yifan Mai, Jean Feng

TL;DR
This paper introduces a statistically rigorous framework for adaptive auditing of AI systems using anytime-valid inference, enabling reliable conclusions from limited, adaptively sampled data.
Contribution
It develops a novel hypothesis testing approach based on Safe Anytime-Valid Inference, addressing challenges of adaptive testing in AI system evaluation.
Findings
Maintains anytime-valid type-I error control in adaptive audits.
Outperforms pre-specified testing methods in empirical evaluations.
Reaches rigorous conclusions with as few as 20 observations.
Abstract
A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two 'dueling' perspectives: (i) the model's null that asserts there is no failure mode with performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
