Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention
Rakshith Vasudev, Melisa Russak, Dan Bikel, Waseem Alshikh

TL;DR
This paper demonstrates that high offline accuracy of LLM critics does not guarantee safe interventions at deployment, and introduces a pre-deployment test to predict intervention outcomes and prevent performance regressions.
Contribution
It reveals the variability in intervention effects despite critic accuracy and proposes a small-scale pre-deployment test to assess intervention safety.
Findings
High critic accuracy does not ensure safe intervention.
Interventions can both recover and disrupt task trajectories.
Pre-deployment testing can predict when intervention is beneficial or harmful.
Abstract
Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Software System Performance and Reliability · Military Defense Systems Analysis
