AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI
Manik Rana, Calissa Man, Anotida Expected Msiiwa, Jeffrey Paine, Kevin Zhu, Sunishchal Dev, Vasu Sharma, Ahan M R

TL;DR
AgentChangeBench is a new benchmark for evaluating how well conversational AI agents adapt to mid-dialogue goal shifts, emphasizing robustness, efficiency, and recovery in realistic enterprise scenarios.
Contribution
It introduces a comprehensive evaluation framework with four metrics and a large dataset to assess goal-shift robustness in tool-augmented language models.
Findings
GPT-4o achieves 92.2% recovery rate on airline shifts
Gemini's recovery rate drops to 48.6% on the same task
Redundancy rates exceed 80% in retail tasks, indicating inefficiencies
Abstract
Goal changes are a defining feature of real world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benchmark explicitly designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts across three enterprise domains. Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises 2,835 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Using this setup, we evaluate several frontier models and uncover sharp contrasts obscured by traditional scores: for example,…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
Novel and Relevant Contribution: The core idea—evaluating agents on their ability to handle mid-conversation goal shifts—is highly relevant and addresses a critical shortcoming in current evaluation paradigms. This focus directly improves the realism of agent assessment for enterprise deployment. Comprehensive and Systematic Benchmark Design: The benchmark is well-constructed, with substantial scale (590 tasks), coverage across multiple realistic domains, and a clear, declarative task schema. Th
Statistical Reporting and Interpretation: Lack of Statistical Significance: The results are presented as point estimates (e.g., TSR percentages) without any measures of variance or statistical significance testing (e.g., confidence intervals, p-values). Given that each task was run only 3 times, the stability of these metrics is unclear. Claims about model superiority (e.g., "Claude-3.7-Sonnet recovers fastest") would be significantly strengthened by statistical validation. Inconsistent Precisio
1. The motivation is both important and insightful, as in real human-agent interactions, users indeed tend to shift their goals. 2. The work is easy to follow, and the proposed benchmark provides a valuable foundation for future research in this area.
1. Motivation. The authors should elaborate further on the motivation for introducing persona. From my perspective, the main focus of this paper is on goal shifting, and the use of persona appears to serve the purpose of making the benchmark more realistic. A more detailed explanation of this design choice is necessary. 2. Experiments. The authors present one main experiment in the appendix; however, several questions remain. (1) How do open-source models such as the Qwen3-series and GPT-OSS per
1. Multi-Domain Coverage: The benchmark covers four major real-world domains (banking, airline, retail, education), with diverse task types that align with enterprise needs. 2. Multi-Turn Dialogue & Goal Shift: By emphasizing multi-turn tasks and explicit goal shifts, the benchmark simulates realistic business workflows where user needs evolve during the conversation. This is more challenging and meaningful than traditional single-turn tasks. 3. Detailed Persona Design: Although the personas are
1. Incremental Advancement: The benchmark is primarily an extension of existing tool-use benchmarks, adding goal shift and persona dimensions. It lacks exploration of more advanced topics such as autonomous agent planning, multi-agent collaboration, and automatic goal recognition. 2. Lack of Extreme Scenarios: The benchmark does not systematically test for edge cases such as security boundaries, exception flows, or adversarial attacks. 3. Inconsistent Task Counts: The paper inconsistently report
It is a correct observation that other established benchmarks focus on static goals in a conversation, and they follow a golden path towards that goal. This is unrealistic for many real life applications. The paper proposes a much more realistic benchmark.
1. It is unusual for a benchmark paper not to show any results using the existing models. Why did you put all the results in Appendix? As is, it does not read like a paper. This is the biggest weakness. Conclusions mention some experiments but they do not exist in the main paper. I suspected whether I am reading a draft version of the paper. 2. On top of this, I'd have preferred at least a baseline approach for the authors to tackle the goal shift during the conversation.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Personal Information Management and User Behavior · AI in Service Interactions
