Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
Mukund Pandey

TL;DR
This paper introduces a new evaluation framework for agentic AI systems operating in production, addressing failure modes and output drift that existing lab-based benchmarks overlook.
Contribution
It presents a taxonomy of seven production-specific failure modes, empirically demonstrates the limitations of standard metrics, and proposes the open-source PAEF framework for continuous evaluation.
Findings
Standard metrics fail to detect four of seven failure modes.
Output drift occurs over multiple evaluation cycles.
The PAEF framework enables ongoing assessment of agentic AI in production.
Abstract
Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion-event scale. Second, we demonstrate empirically where standard metrics -- ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above -- fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
