Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning

Vishal Srivastava

arXiv:2602.16984·cs.AI·February 20, 2026

Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning

Vishal Srivastava

PDF

Open Access

TL;DR

This paper demonstrates fundamental information-theoretic and computational barriers to reliably evaluating AI safety in black-box models, especially when models depend on unobserved internal variables that are rare during testing but common in deployment.

Contribution

It formalizes the limits of black-box safety evaluation for latent context-dependent models and shows when additional safeguards are mathematically necessary.

Findings

01

Passive evaluation error lower bound ~0.208*delta*L

02

Adaptive evaluation error remains high even with optimal querying

03

Computational separation under cryptographic assumptions

Abstract

Black-box safety evaluation of AI systems assumes model behavior on test distributions reliably predicts deployment performance. We formalize and challenge this assumption through latent context-conditioned policies -- models whose outputs depend on unobserved internal variables that are rare under evaluation but prevalent under deployment. We establish fundamental limits showing that no black-box evaluator can reliably estimate deployment risk for such models. (1) Passive evaluation: For evaluators sampling i.i.d. from D_eval, we prove minimax lower bounds via Le Cam's method: any estimator incurs expected absolute error >= (5/24)*delta*L approximately 0.208*delta*L, where delta is trigger probability under deployment and L is the loss gap. (2) Adaptive evaluation: Using a hash-based trigger construction and Yao's minimax principle, worst-case error remains >= delta*L/16 even for fully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Software Testing and Debugging Techniques · Machine Learning and Algorithms