Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

Rakshith Vasudev; Melisa Russak; Dan Bikel; Waseem Alshikh

arXiv:2602.03338·cs.CL·February 4, 2026

Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

Rakshith Vasudev, Melisa Russak, Dan Bikel, Waseem Alshikh

PDF

Open Access

TL;DR

This paper demonstrates that high offline accuracy of LLM critics does not guarantee safe interventions at deployment, and introduces a pre-deployment test to predict intervention outcomes and prevent performance regressions.

Contribution

It reveals the variability in intervention effects despite critic accuracy and proposes a small-scale pre-deployment test to assess intervention safety.

Findings

01

High critic accuracy does not ensure safe intervention.

02

Interventions can both recover and disrupt task trajectories.

03

Pre-deployment testing can predict when intervention is beneficial or harmful.

Abstract

Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Software System Performance and Reliability · Military Defense Systems Analysis