Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning
Vishal Srivastava

TL;DR
This paper demonstrates fundamental information-theoretic and computational barriers to reliably evaluating AI safety in black-box models, especially when models depend on unobserved internal variables that are rare during testing but common in deployment.
Contribution
It formalizes the limits of black-box safety evaluation for latent context-dependent models and shows when additional safeguards are mathematically necessary.
Findings
Passive evaluation error lower bound ~0.208*delta*L
Adaptive evaluation error remains high even with optimal querying
Computational separation under cryptographic assumptions
Abstract
Black-box safety evaluation of AI systems assumes model behavior on test distributions reliably predicts deployment performance. We formalize and challenge this assumption through latent context-conditioned policies -- models whose outputs depend on unobserved internal variables that are rare under evaluation but prevalent under deployment. We establish fundamental limits showing that no black-box evaluator can reliably estimate deployment risk for such models. (1) Passive evaluation: For evaluators sampling i.i.d. from D_eval, we prove minimax lower bounds via Le Cam's method: any estimator incurs expected absolute error >= (5/24)*delta*L approximately 0.208*delta*L, where delta is trigger probability under deployment and L is the loss gap. (2) Adaptive evaluation: Using a hash-based trigger construction and Yao's minimax principle, worst-case error remains >= delta*L/16 even for fully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Software Testing and Debugging Techniques · Machine Learning and Algorithms
