TL;DR
This paper introduces AgentProp-Bench, a comprehensive benchmark for evaluating tool-using LLM agents, analyzing judge reliability, error propagation, and mitigation strategies with extensive human-validated data.
Contribution
It provides a large, validated benchmark and insights into evaluation reliability, error propagation, and mitigation techniques for tool-using language agents.
Findings
Substring judging aligns poorly with human annotation (kappa=0.049).
A three-LLM ensemble improves judging agreement (kappa=0.432).
Runtime mitigation reduces hallucinations significantly on GPT-4o-mini.
Abstract
Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100-label human-validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring-based judging agrees with human annotation at kappa=0.049 (chance-level); a three-LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter-level injection propagates to a wrong final answer with human-calibrated probability approximately 0.62 (range 0.46-0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
