Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

Bhaskar Gurram

arXiv:2604.16706·cs.AI·April 21, 2026

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

Bhaskar Gurram

PDF

1 Repo

TL;DR

This paper introduces AgentProp-Bench, a comprehensive benchmark for evaluating tool-using LLM agents, analyzing judge reliability, error propagation, and mitigation strategies with extensive human-validated data.

Contribution

It provides a large, validated benchmark and insights into evaluation reliability, error propagation, and mitigation techniques for tool-using language agents.

Findings

01

Substring judging aligns poorly with human annotation (kappa=0.049).

02

A three-LLM ensemble improves judging agreement (kappa=0.432).

03

Runtime mitigation reduces hallucinations significantly on GPT-4o-mini.

Abstract

Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100-label human-validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring-based judging agrees with human annotation at kappa=0.049 (chance-level); a three-LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter-level injection propagates to a wrong final answer with human-calibrated probability approximately 0.62 (range 0.46-0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bhaskargurram-ai/agenthallu-bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.