E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing
Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Aviv Regev, Hanchen Wang

TL;DR
E-valuator is a statistical framework that transforms existing agent trajectory verifiers into reliable decision rules with controlled false alarm rates, improving the safety and efficiency of agent systems.
Contribution
It introduces a novel sequential hypothesis testing method that guarantees false alarm control for any black-box verifier in agent trajectory evaluation.
Findings
Outperforms existing strategies in statistical power and false alarm control
Enables online monitoring and early termination of problematic trajectories
Provides a lightweight, model-agnostic approach for reliable agent evaluation
Abstract
Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · AI-based Problem Solving and Planning
