E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

Shuvom Sadhuka; Drew Prinster; Clara Fannjiang; Gabriele Scalia; Aviv Regev; Hanchen Wang

arXiv:2512.03109·cs.LG·December 4, 2025

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Aviv Regev, Hanchen Wang

PDF

Open Access

TL;DR

E-valuator is a statistical framework that transforms existing agent trajectory verifiers into reliable decision rules with controlled false alarm rates, improving the safety and efficiency of agent systems.

Contribution

It introduces a novel sequential hypothesis testing method that guarantees false alarm control for any black-box verifier in agent trajectory evaluation.

Findings

01

Outperforms existing strategies in statistical power and false alarm control

02

Enables online monitoring and early termination of problematic trajectories

03

Provides a lightweight, model-agnostic approach for reliable agent evaluation

Abstract

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · AI-based Problem Solving and Planning