Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Yun-Shiuan Chuang; Chaitanya Kulkarni; Alec Chiu; Avinash Thangali; Zijie Pan; Shivani Shekhar; Yirou Ge; Yixi Li; Uma Kona; Linsey Pang; Prakhar Mehrotra

arXiv:2602.16246·cs.AI·May 14, 2026

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, Uma Kona, Linsey Pang, Prakhar Mehrotra

PDF

TL;DR

This paper introduces Proxy State-Based Evaluation, a scalable LLM-driven simulation framework for benchmarking multi-turn tool-using LLM agents, providing reliable, on-policy evaluation without costly deterministic backends.

Contribution

It proposes a novel proxy state-based framework that enables scalable, reliable evaluation of LLM agents in multi-turn interactions without deterministic databases.

Findings

01

Produces stable, model-differentiating rankings.

02

Transfers supervision to unseen scenarios.

03

Achieves over 90% human-LLM judge agreement.

Abstract

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks, such as tau-bench, tau^2-bench, and AppWorld, rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.