How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $\tau$-bench
Venkatesh Mishra, Amir Saeidi, Satyam Raj, Mutsumi Nakamura, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral

TL;DR
This paper introduces IRMA, a framework that reformulates user inputs with domain rules and tool suggestions, significantly improving tool usage accuracy of large language models in complex, dynamic environments like $ au$-bench.
Contribution
The paper presents IRMA, an automatic input reformulation method that enhances LLM tool use accuracy in multi-turn, dynamic environments, outperforming existing approaches.
Findings
IRMA outperforms ReAct, Function Calling, and Self-Reflection in overall pass scores.
Input reformulation improves reasoning and decision-making in LLM agents.
The approach increases reliability and consistency in tool usage in complex environments.
Abstract
Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like -bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling…
| Model | Method | -Retail | -Airline | Overall |
|---|---|---|---|---|
| Open-Source Models | ||||
| Qwen 2.5 32B | ReAct | 24.4 | 25.0 | 24.7 |
| Llama 3.1 70B | ReAct | 50.4 | 26.0 | 38.2 |
| DeepSeek v31 | ReAct | 58.3 | 22.8 | 40.6 |
| Phi-4 14B | ReAct | 32.2 | 28.0 | 30.1 |
| Close-Source Models | ||||
| Gemini 1.5 pro1 | FC | 54.9 | 25.2 | 40.1 |
| Claude 3.5 Haiku2 | FC | 51.0 | 22.8 | 36.9 |
| Claude 3.5 Sonnet2 | FC | 62.6 | 36.0 | 49.3 |
| gpt-4o | FC | 60.5 | 42.4 | 51.4 |
| gpt-4o | ReAct | 51.8 | 39.6 | 45.7 |
| gpt-4o | SR | 51.1 | 44.8 | 47.9 |
| gpt-4o (ours) | IRMA | 58.3 | 47.2 | 52.75 |
| Method | Pass^1 | Pass^2 | Pass^3 | Pass^4 | Pass^5 |
|---|---|---|---|---|---|
| ReAct | 0.396 | 0.2779 | 0.2279 | 0.200 | 0.180 |
| IRMA | 0.452 | 0.3680 | 0.3280 | 0.308 | 0.300 |
| FC | 0.424 | 0.3120 | 0.2660 | 0.232 | 0.200 |
| Self-reflection | 0.448 | 0.3140 | 0.2560 | 0.224 | 0.200 |
| Method | Pass^1 | Pass^2 | Pass^3 | Pass^4 | Pass^5 |
|---|---|---|---|---|---|
| ReAct | 0.4941 | 0.3735 | 0.3206 | 0.2882 | 0.2647 |
| IRMA | 0.5706 | 0.4912 | 0.4471 | 0.4235 | 0.4118 |
| FC | 0.5529 | 0.4353 | 0.3794 | 0.3353 | 0.2941 |
| Self reflection | 0.5167 | 0.3750 | 0.3139 | 0.2778 | 0.2500 |
| Method | Pass^1 | Pass^2 | Pass^3 | Pass^4 | Pass^5 |
|---|---|---|---|---|---|
| ReAct | 0.5226 | 0.4065 | 0.3516 | 0.3161 | 0.2903 |
| IRMA | 0.6258 | 0.5387 | 0.4903 | 0.4645 | 0.4516 |
| FC | 0.6000 | 0.4774 | 0.4161 | 0.3677 | 0.3226 |
| Self reflection | 0.5556 | 0.4146 | 0.3528 | 0.3111 | 0.2778 |
| Method | Pass^1 | Pass^2 | Pass^3 | Pass^4 | Pass^5 |
|---|---|---|---|---|---|
| ReAct | 0.5182 | 0.3704 | 0.2999 | 0.2573 | 0.2260 |
| IRMA | 0.5826 | 0.4783 | 0.4261 | 0.3948 | 0.3739 |
| FC | 0.6052 | 0.4522 | 0.3643 | 0.3043 | 0.2609 |
| Self-reflection | 0.5113 | 0.3809 | 0.3017 | 0.2383 | 0.1826 |
| Method | Pass^1 | Pass^2 | Pass^3 | Pass^4 | Pass^5 |
|---|---|---|---|---|---|
| ReAct | 0.5304 | 0.3804 | 0.3080 | 0.2643 | 0.2321 |
| IRMA | 0.5982 | 0.4911 | 0.4375 | 0.4054 | 0.3839 |
| FC | 0.6164 | 0.4616 | 0.3732 | 0.3125 | 0.2679 |
| self-reflection | 0.5250 | 0.3911 | 0.3098 | 0.2446 | 0.1875 |
| Method | Pass^1 | Pass^2 | Pass^3 | Pass^4 | Pass^5 |
|---|---|---|---|---|---|
| ReAct | 0.5562 | 0.4048 | 0.3200 | 0.2818 | 0.2476 |
| ours | 0.6248 | 0.5171 | 0.4629 | 0.4305 | 0.4095 |
| FC | 0.6381 | 0.4838 | 0.3933 | 0.3314 | 0.2857 |
| self reflection | 0.5562 | 0.4171 | 0.3305 | 0.2610 | 0.2000 |
| IRMA Ablations (↓) | Pass^1 | Pass^2 | Pass^3 | Pass^4 | Pass^5 |
|---|---|---|---|---|---|
| M | 0.416 | 0.27 | 0.212 | 0.18 | 0.16 |
| C | 0.416 | 0.276 | 0.206 | 0.164 | 0.14 |
| T | 0.424 | 0.268 | 0.19 | 0.14 | 0.1 |
| M + C | 0.428 | 0.31 | 0.26 | 0.236 | 0.22 |
| M+ T | 0.448 | 0.294 | 0.214 | 0.16 | 0.12 |
| C + T | 0.38 | 0.264 | 0.212 | 0.18 | 0.16 |
| M + C + T | 0.452 | 0.368 | 0.328 | 0.308 | 0.3 |
| Method (↓) | Pass^1 | Pass^2 | Pass^3 | Pass^4 | Pass^5 |
|---|---|---|---|---|---|
| ReAct | 0.188 | 0.09 | 0.052 | 0.032 | 0.02 |
| FC | 0.236 | 0.1179 | 0.062 | 0.036 | 0.02 |
| IRMA | 0.208 | 0.144 | 0.106 | 0.08 | 0.06 |
| Method (↓) | Pass^1 | Pass^2 | Pass^3 | Pass^4 | Pass^5 |
|---|---|---|---|---|---|
| IRMA (R) | 0.4 | 0.302 | 0.256 | 0.232 | 0.22 |
| IRMA (F) | 0.452 | 0.368 | 0.328 | 0.308 | 0.3 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis
How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on -bench
Venkatesh Mishra*†∗* Amir Saeidi† Satyam Raj† Mutsumi Nakamura†
Jayanth Srinivasa‡ Gaowen Liu‡ Ali Payani‡ Chitta Baral†
†Arizona State University ‡Cisco Research
{vmishr23, ssaeidi1, chitta}@asu.edu, {jasriniv, gaoliu, apayani}@cisco.com Equal Contribution
Abstract
Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like -bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision-making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.
How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on -bench
Venkatesh Mishra†∗ Amir Saeidi†††thanks: Equal Contribution Satyam Raj† Mutsumi Nakamura†**
Jayanth Srinivasa‡ Gaowen Liu‡ Ali Payani‡ Chitta Baral†
†Arizona State University ‡Cisco Research
{vmishr23, ssaeidi1, chitta}@asu.edu, {jasriniv, gaoliu, apayani}@cisco.com
1 Introduction
Recent advancements in Large Language Models (LLMs) (Annepaka and Pakray, 2025) have created the potential for them to be used as autonomous agents in complex real-world tasks like travel-booking, customer-support, and enterprise operations (Chen et al., 2024a; Wang et al., 2024; Singh et al., 2024; Yang et al., 2024). However, such complex tasks require the need of reasoning and planning capabilities beyond just language processing: they require the ability on behalf of these agents to be able to invoke suitable tools111The terms ’tool-use’, ’tool-calling’ and ’function-calling’ are used interchangeably in this paper which can complete tasks through logic implemented in computer programs leading to deterministic outcomes. Recent research (Yao et al., 2024; Lu et al., 2024; Yan et al., 2024), which benchmarks the simulation of such real-world problem-solving settings, shows that LLM-agents significantly falter in correctly solving these tasks and commit errors that range from generative hallucinations to failure to adhere to context and domain-specific policy violations by incorrect reasoning about actions over extended interactions.
These shortcomings underscore the need for more fine-grained evaluations and methods that can diagnose and address the nuanced failure modes of LLM agents in complex, real-world interactions that employ natural language as a form of communication. Thus, our main focus in this work is to find and mitigate the causes of why language agents fail to solve simulations of real-world conversational requests that require complex reasoning and relevant information processing according to the situation at hand. To this end, we utilize - bench (Yao et al., 2024) as an appropriate test-bed for such investigation as it emulates realistic airline and retail dialogues. We define the reasoning about actions of language agents as the ability to generate context-aware inference and decision-making tokens for selecting the next best action (a tool-call in this context). Additionally, we define and evaluate the planning capabilities of the agents through decision-making for tool-calling over multiple tool-calls in the correct sequential manner to complete a goal.
Inspired by recent work in context engineering (Mei et al., 2025), we propose a three-pronged sequential approach. First, we develop a comprehensive error classification that categorizes common reasoning and planning mistakes in a multi-turn tool-calling simulation. This taxonomy serves as a diagnostic guideline to systematically identify and understand the causes of failures for LLM agents. Second, we manually experiment with input reformulations of the user requests to evaluate whether the correct prompt reformulations can guide the tool-calling agents towards correct decision-making through appropriate tool-calling/response to the user. Third, we automate this prompt-reformulation process by building a multi-agent LLM framework (§5.2), called Input-Reformulation Multi-Agent (IRMA), which further optimizes the input reformulation with augmentation of follow-up questions (§5.1). Before the tool-calling agent invokes or responds to any tool output, our automated framework supplies targeted guidance that ensures strict adherence to domain-specific rules and well-placed follow-up questions to extract accurate information, thereby enhancing its reasoning and planning capabilities in dynamic environments.
Our results show that the IRMA framework not only outperforms ReAct Yao et al. (2023), Function Calling, and Self-Reflection Renze and Guven (2024) on pass@1, but also achieves** 20%** and 22.4% higher accuracy on Airline tasks compared to Gemini 1.5 Pro-FC and Claude 3.5 Haiku-FC, respectively. IRMA also demonstrates stronger reliability, with higher scores on pass^4 and pass^5 (Figure 4). In addition, IRMA solves tasks in fewer turns than competing methods, highlighting its efficiency (Figure 6). Lastly, IRMA shows greater robustness, with an increased performance gap on pass^5 after removing tasks affected by ground truth and instruction errors in the airline and retail domains.
The main contributions of our work are:
Fine-grained causal-centric error classification of failure modes occurring in a multi-turn tool-use conversational benchmark. 2. 2.
We propose the Input-Reformulation Multi-Agent Framework (IRMA), a verification-loop-free approach that improves function-calling agents by reformulating prompts with structured and contextually relevant information. IRMA guides the agent to better follow domain policies by enriching its input with key constraints and tool-related context. 3. 3.
We perform an in-depth evaluation of IRMA’s performance across reliability, consistency, and accuracy. Furthermore, our analysis of efficiency reveals that IRMA is able to solve tasks using fewer interaction turns than competing methods.
2 Related Works
Tool-Integration for LLMs
The ReAct framework, introduced by Yao et al. (2023), is one of the first approaches to explore the potential of Large Language Models (LLMs) as tool-using agents by integrating reasoning and acting within LLMs. Toolformer (Schick et al., 2023) presents a fine-tuning approach to teach LLMs to invoke tool calls. ToolEVO (Chen et al., 2024b) and ToolLLM (Qin et al., 2023) employ tree search algorithms for integrating and evaluating tool-learning capabilities in LLMs. ToolACE (Liu et al., 2024b), AutoTools (Shi et al., 2025), and APIGen (Liu et al., 2024c) introduce automated frameworks designed to generate accurate, complex, and high-quality tool-learning data, with works like (Prabhakar et al., 2025; Yin et al., 2025) extending this to multi-turn interactive conversational settings.
Tool-Use Benchmarks
LLMs have been extensively evaluated on invoking external functions in both single-turn and interactive multi-turn conversational test beds. API-Bench (Patil et al., 2024) and API-Bank (Li et al., 2023) are two prominent benchmarks designed to evaluate the function-calling capabilities of LLMs in single-turn scenarios. NESTful (Basu et al., 2024) focuses on evaluating LLMs’ ability to handle nested sequences of API calls. ToolQA-D (Chen et al., 2024b) gauges robustness in changing API specifications. -bench (Yao et al., 2024) and ToolSandbox (Lu et al., 2024) emulate realistic dialogues requiring policy-compliant tool use over multi-turn user-agent interactions, where each step modifies an external environment. While these existing multi-turn benchmarks evaluate the overall success of tool-calling agents, they lack fine-grained analysis of reasoning errors while following complex domain rules—a gap our work addresses through the construction of a fine-grained error classification by evaluating -bench.
Improving LLM Tool-Use
Recent research has explored diverse strategies to enhance the tool-use capabilities of LLMs, focusing on API calling and web-environment interaction—by leveraging techniques such as synthetic data generation, reinforcement learning, and memory augmentation. Liu et al. (2024c) introduces APIGen, an automated pipeline that generates high-quality, verifiable single-turn function-calling datasets, enabling small models to outperform GPT-4 on the BFCL (Patil et al., 2025). APIGen-MT (Prabhakar et al., 2025) extends the framework to show improvement in models on multi-turn scenarios through blueprint-driven simulation of human–agent dialogues. ReTool (Feng et al., 2025) integrates dynamic code execution within the reasoning process and training via outcome-driven RL, which significantly improves multi-step reasoning. Nemotron-Tool-N1 (Zhang et al., 2025) uses an RL framework to teach precise tool invocation and explicit reasoning, achieving state-of-the-art on API-Bank (Li et al., 2023) and BFCL. ARTIST (Singh et al., 2025) integrates agentic reasoning with RL, enabling LLMs to decide autonomously when and how to call tools. Memento (Zhou et al., 2025) employs a memory-augmented, case-based planner for continual adaptation without retraining, achieving strong generalization on GAIA (Mialon et al., 2023) and DeepResearcher (Zheng et al., 2025) benchmarks. While these works mark a shift toward adaptive, planning-driven, and memory-augmented LLM agents by leveraging training methods, our proposed IRMA framework explores tool-use improvement from the perspective of context engineering (Mei et al., 2025) principles.
3 Problem Statement
To evaluate the tool-usage capabilities of current Large Language Models (LLMs), we adopt the benchmark provided by -bench (Yao et al., 2024). This benchmark is specifically designed to assess language agents in realistic, multi-turn interaction settings. -bench includes tasks from two domains: (1) Airline, comprising 50 tasks centered around flight reservation scenarios, and (2) Retail, containing 115 tasks focused on shopping and order management. In this setup, both the user and the customer-service assistant are simulated by LLMs, enabling a controlled environment for analyzing interactive behavior. The customer-service agent is the language agent that generates the tokens signifying which tools are to be invoked, while following the specific domain policies (refer Appendix C)
Each task is framed as a Partially Observable Markov Decision Process (POMDP) (Details in Appendix A), where the assistant agent must generate appropriate function calls based on user inputs. These function calls are executed in an external environment, which then returns outputs that shape the ongoing dialogue. The interaction continues until the user ends the conversation, and the performance of the assistant is evaluated based on final rewards. These rewards reflect how closely the agent’s actions align with gold-standard trajectories and how well it fulfills the user’s goals.
A key challenge in -bench arises from the dynamic nature of user-agent interactions, where both user inputs and agent responses can vary across runs. This variability requires the agent to consistently execute correct action sequences, regardless of the conversational path. However, current results indicate that even state-of-the-art LLMs struggle to reliably complete these tasks as the number of trials increases. To address this limitation, we conduct a root-cause analysis of common agent errors (§4) and introduced IRMA, a multi-agent framework (§5) designed to improve agent reliability in this challenging setting.
4 Error-Classification
To identify the failure modes of LLMs, human evaluators conducted experiments using GPT-4o (Hurst et al., 2024) as the base model for both the user and the assistant agent across all tasks in -bench (Yao et al., 2024). Both ReAct and function-calling agent configurations were used to generate up to five trials per task in each domain. Evaluators manually reviewed the resulting multi-turn conversation trajectories from the retail and airline domains. While prior studies (Sun et al., 2024; Winston and Just, 2025; Cemri et al., 2025) have examined failures related to tool availability, definition errors, or tool set complexity, our analysis focuses specifically on the contextual reasoning limitations of LLMs in generating tool calls within dynamic, multi-turn interactions.
Although -bench provides a general taxonomy of failure types for the retail domain, our classification is more cause-oriented than effect-oriented. By framing errors in terms of their underlying causes, we can more effectively inform the design of targeted interventions, such as retrieval-augmented memory to mitigate context retention issues or follow-up question generation (§5.1) to reduce hallucinations from context drift. The following subsections (§4.1–§4.4) provide a detailed breakdown of the identified error types.
4.1 User Instruction Hallucination
User instruction errors occur when the LLM-simulated user deviates from the original task instruction, typically in the later stages of a conversation. These errors highlight the limitations of LLMs in maintaining instruction fidelity over long contexts, especially when multiple follow-up turns introduce competing directives. Another contributing factor is context drift, where the model increasingly relies on recent inputs or high-probability continuations, leading it to overlook or forget the initial user intent. An Example illustrating this error is provided in Figure 8 in Appendix D.
4.2 Agent Hallucination
Agent hallucination errors arise when the assistant agent generates incorrect or incomplete responses that fail to fully satisfy the user’s request. For example, the agent may neglect to process all items specified by the user or incorrectly fulfill a request by selecting the wrong item or applying it to the wrong order. These errors reflect underlying challenges with LLM memory limitations (Shan et al., 2025) and the degradation of instruction-following abilities over long contexts (Liu et al., 2023). As prior context accumulates, excessive or outdated information can distort the model’s understanding, leading to hallucinated outputs and ultimately incorrect decisions (Zhang et al., 2024).
4.3 Domain Policy Violation
Domain policy violations occur when tool-calling agents make decisions that contradict the domain-specific constraints defined for task completion. For instance, in Retail task 19 (Figure 9), the agent attempts to exchange the user’s office chair and pet bed even when the order is no more in ‘delivered’ status: a prerequisite domain rule required to be satisfied for exchange. This leads to the agent violating the domain rule (see Figure 14): ’An order can only be exchanged if its status is ’delivered’…’ Such violations may also arise when the user issues an invalid request, and the agent proceeds to fulfill it without adhering to the applicable domain rules. This error is caused due to similar reasons as mentioned in §4.1 and §4.2.
4.4 Contextual Misinterpretation
Contextual misinterpretation errors occur when the tool-calling agent misunderstands the intent or nuance of the user’s request and generates function calls using inappropriate tools for the given context. For example, if a user asks to return an item and receive a different one in exchange, a human familiar with the domain policies would recognize this as an exchange request. However, the LLM-based agent may misinterpret it as a simple return, failing to grasp the full context and thereby invoking the wrong tool.
5 Method
As outlined in the previous sections, complex dynamic environments such as -bench present reliability challenges. Specifically, the user simulator may hallucinate during interactions, generating questions that do not adhere to the provided instructions. In this study, we aim to improve the assistant agent’s tool-calling performance in -bench by enabling more accurate decision-making. Unlike prior approaches that monitor and correct agent actions through verification or reflection, our method focuses on enhancing the quality of the agent’s input before any action is taken. To achieve this, we first introduce a novel prompting strategy: Follow-up Question Acting (FACT), designed to support decision-making in dynamic settings. We then present the Input Reformulation Multi-Agent (IRMA) framework that reformulates the agent’s input to guide more effective and context-aware decisions.
5.1 FACT: Follow-up question ACTing
Although reasoning-based prompting techniques like ReAct outperform non-reasoning methods such as Act, they remain inefficient in dynamic environments. As shown in Figure 3, ReAct often calls a tool prematurely, triggers an error, and only then asks clarifying questions, leading to longer conversations and increased interaction issues. To overcome this, we introduce Follow-up Question ACTing (FACT), a prompting method that first gathers information through targeted questions before calling a tool. Our results in Figure 6 show that FACT is more effective than ReAct and performs comparably to Function Calling. We refer readers to Appendix §E.1.
Another advantage of FACT is its ability to involve the user in the loop. When the user simulator hallucinates or provides misleading input, FACT detects the issue and hands off the conversation to a human, ensuring more robust handling of unreliable inputs. In summary, FACT is more efficient, reliable, and consistent than other methods in dynamic environments. However, in long conversations, it may forget domain rules and tools due to system prompt limitations, leading to domain violations. To address this, we propose the Input-Reformulation Multi-Agent Framework (IRMA), which restructures the user prompt to retain key information like domain rules and a relevant tool list within the assistant’s input.
5.2 IRMA: Input-Reformulation Multi-Agent Framework
Our analysis reveals three key failure cases for assistant agents. First, in long conversations, the agent may forget the user’s initial request and respond only partially. Second, it may violate domain rules by forgetting constraints from lengthy policy lists. Third, tool selection becomes harder over time, especially when tools have similar names (e.g., "search_direct_flight" vs. "search_onestep_flight"), leading to incorrect choices.
We hypothesize that combining user queries with crucial context, such as domain rules and relevant tools, can improve the assistant agent’s decision-making. To test this, we conducted a human-in-the-loop experiment with prompt engineers who reformulated queries using additional policy and tool information. In most cases, the agent successfully completed the tasks, motivating us to automate this input reformulation process.
Based on this insight, we propose the Input Reformulation Multi-Agent Framework (IRMA). In contrast to prior methods that focus on post-hoc correction of the agent’s behavior, such as Self-Reflection, PlanGen Parmar et al. (2025), or other verification-based approaches, IRMA centers on enhancing the quality of the input provided to the assistant agent. This approach enhances decision-making at the input stage—before any action is taken—ensuring more accurate and context-aware responses. The framework comprises three core modules: memorization, constraints, and tool suggestion.
Memorization
This module is independent of the language model and is responsible for storing the user queries throughout the interaction trajectory. It helps the agent retain awareness of the initial request and make decisions accordingly. The conversation history is maintained within <memory> tags.
Constraints
One of the main reasons the agent makes incorrect decisions is domain policy violation. A key insight from the human-in-the-loop experiment was the positive impact of providing a concise list of domain constraints to guide the assistant agent’s decisions. To address this challenge, we define a dedicated agent that generates a checklist of relevant domain constraints based on the user query. If the user query is a response to a follow-up question from the assistant, the agent is prompted to return “None”. The generated constraint list is stored within <constraints> tags to ensure the assistant agent receives a structured and interpretable input prompt.
Tool Suggestion
Although the number of available tools is limited, the assistant agent sometimes struggles to select the most relevant tool for a given user query. In some cases, after encountering an error or receiving an empty output, the agent may lose track of other parts of the user’s request. To mitigate this, we introduce a Tool Selector agent that generates a short list of tools most relevant to the user query, along with a one-line explanation for each suggestion. This list is stored within <tool_suggested> tags to help the assistant agent focus on selecting the most appropriate tool.
In summary, the IRMA framework aims to replicate the input reframing performed by researchers during the human-in-the-loop experiment. Unlike other techniques such as verification, self-reflection, or agentic verification methods, IRMA functions in a loop-free manner and focuses on strengthening the input by reformulating the user query. This approach not only improves accuracy but also offers better cost-effectiveness compared to alternative methods. In the next section, we provide a comparative analysis of IRMA against existing techniques.
6 Experiments
6.1 Experimental Setup
We present the baseline models and comparison methods, followed by an analysis of the IRMA framework using various evaluation metrics and ablation studies (refer Appendix G) to assess the impact of its individual components on -bench performance.
Models and Methods
We evaluated IRMA against a range of open-source and closed-source language models. The open-source models include Qwen2.5-32B (Qwen et al., 2025), LLaMA-3.1-70B (Grattafiori et al., 2024), DeepSeek-v3 (Liu et al., 2024a), and Phi-4-14B (Abdin et al., 2024), while the closed-source models comprise Claude 3.5 (Anthropic, 2024b), Gemini 1.5 (Team et al., 2024), and GPT-4o (Hurst et al., 2024). In addition, we compared IRMA with three widely adopted prompting strategies: (1) ReAct, a reasoning-based prompting technique; (2) Function Calling, designed specifically to enhance a model’s tool-calling capability; and (3) Self-Reflection, a method aimed at improving tool-use performance by addressing errors in the agent’s actions.
Evaluation
To evaluate performance, we use the pass^k metrics, which measure the reliability and consistency of models across different prompting strategies. The pass^k metric (pronounced "pass hat k") is defined as the probability that all of the k independently sampled outputs successfully complete the task, averaged across all tasks. Specifically, if a task is run for independent trials and of those are successful (i.e., have a correct result with reward ), an unbiased estimate of pass ^k can be computed using the following formula:
[TABLE]
This metric provides insight into how likely a model is to succeed given multiple attempts, capturing both reliability and diversity in its outputs.
6.2 Experimental Results
As outlined in the -bench, in real-world scenarios—reliability and consistency are often more critical than the average success rate (measured by pass@1). We argue that an ideal agentic method should exhibit three key properties: (1) Accuracy, (2) Reliability, and (3) Consistency. Accordingly, we begin by comparing results using pass@1 to assess accuracy, and then evaluate the performance of state-of-the-art methods using pass^k to measure reliability and consistency.
IRMA outperforms other state-of-the-art methods in tool calling.
We conducted evaluations of multiple methods—Function Calling (FC), ReAct, and Self-Reflection—each executed over five trials. These experiments were performed using the GPT-4o model. The results, presented in Table 1, show that the IRMA framework outperforms ReAct, Self-Reflection, and FC by 6.1%, 3.9%, and 0.4%, respectively, in overall pass@1 score. Additionally, in the airline tasks, which represent the most challenging scenarios within the dynamic environment, IRMA on GPT-4o achieves improvements of 20%, 22.4%, and 9.2% compared to Gemini 1.5 Pro-FC, Claude 3.5 Haiku-FC, and Claude 3.5 Sonnet-FC, respectively. These findings highlight IRMA’s strong accuracy in real-world tasks and demonstrate its effectiveness over existing methods.
IRMA is more reliable and consistent than other methods in dynamic settings.
The results in Table 1 show that the performance of IRMA on retail pass^1 is slightly lower than that of GPT-4o-FC. For this reason, we further explored the performance of other methods using pass^k to evaluate their reliability and consistency. The results in Figure 6 show that IRMA, compared with ReAct and FC on GPT-4o, is much more reliable and consistent, outperforming ReAct and FC by 16.1% and 12.6%, respectively, in overall scores on pass^5.
IRMA is more robust on tasks with GT and UI errors.
As explained in the previous sections, -bench suffers from two major issues: (1) Ground Truth (GT) errors and (2) User Instruction (UI) errors. Figure 5 shows the distribution of these errors across the airline and retail tasks. We progressively removed tasks affected by these problems, and the results revealed that the performance of all three methods improved, with IRMA showing slightly greater gains compared to the others. We hypothesize that IRMA is more robust to hallucination-related issues. Specifically, in tasks with GT errors, IRMA tends to avoid incorrect tool calls or invalid actions and instead produces safe and accurate responses.
A key observation is the change in performance difference between IRMA and FC on pass^5. Before removing tasks with GT and UI errors, IRMA outperformed FC by 10%. However, after removing these problematic tasks, the performance gap widened to 16.1% on average. Similar patterns were observed for other methods as well, reinforcing the claim that IRMA is more robust and less sensitive to noisy supervision and ambiguous instructions compared to existing techniques.
IRMA solves tasks more efficiently and effectively, using fewer turns than others.
One of the primary reasons assistant agents make incorrect decisions in the final turns is the length of the conversation, which often causes them to forget important rules and instructions. In an ideal scenario, an assistant should resolve the user’s query with the fewest but most effective actions. To investigate this aspect, we analyzed the distribution of turns in successful task completions by IRMA, ReAct, FC, and Self-Reflection, as shown in Figure 6. The results show that, in retail tasks, IRMA completes tasks with 7.9 points fewer turns than Self-Reflection and 3.1 points fewer than FC. In airline tasks, IRMA requires 8.3 fewer turns than Self-Reflection, 1.1 fewer than FC, and 3.3 fewer than ReAct. These results demonstrate IRMA’s superior efficiency compared to other state-of-the-art methods.
Input Reformulation framework vs Self-Reflection
The central concept of IRMA is to reformulate the agent’s input under the assumption that supplying sufficient and well-structured information enables the agent to act more reliably and consistently in real-world scenarios. To evaluate this, we implemented the Self-Reflection method (Appendix F), which analyzes the agent’s previous actions and extracts relevant information from domain rules to guide future decisions (see section E.1 for implementation details). As shown in Figure 4, IRMA outperforms Self-Reflection in both airline and retail tasks, achieving a 3.9% higher overall score in pass@1. More notably, IRMA exceeds Self-Reflection by 19.1% in pass^5, highlighting its superior reliability in a real-world environment.
In summary, while ReAct and Self-Reflection perform well in certain settings, they fall short in complex, dynamic environments like -bench. Role-play methods, including verification techniques, are also inefficient, as real-world scenarios require assistant agents to act based on limited information, with each action affecting the environment. Although Function Calling was designed for tool use, our results show it lacks reliability in decision-making and offers limited controllability, even in GPT-4o with tailored system prompts. Combining FACT with GPT-4o-FC led to a 12% performance drop, highlighting the need for more robust approaches. In contrast, IRMA consistently delivers higher accuracy, reliability, and consistency in dynamic environments like -bench.
7 Conclusion
In this work, we investigate the limitations of state-of-the-art LLM-based tool-calling agents in complex, multi-turn environments, focusing on the retail and airline domains of -bench. Through a detailed analysis of conversation trajectories, we identify four major failure modes: user instruction hallucination, agent hallucination, domain-policy violations, and contextual misinterpretation, all of which stem from limitations in memory retention, contextual reasoning, and adherence to domain constraints across extended interactions. To address these challenges, we propose the Input Reformulation Multi-Agent (IRMA) framework, designed to enhance the structure of the assistant agent’s input. Our results show that IRMA not only outperforms other methods in pass^1 but also demonstrates significantly higher reliability, achieving an overall score of 43% pass^5 in -bench. Moreover, by leveraging the FACT agent, IRMA exhibits greater efficiency in task completion. In conclusion, IRMA shows robust and consistent behavior in the unreliable and dynamic environment of -bench, highlighting its effectiveness in real-world tool-use scenarios.
Limitations
Although the Input Reformulation Multi-Agent (IRMA) framework demonstrated superior performance on -bench, several limitations remain. As shown in Figure 4, while IRMA exhibits greater reliability compared to other methods, its performance on pass^5 still hovers around 43%. This indicates that there is still considerable room for improving the reliability of tool-using agents in real-world scenarios. Another limitation of this work is that our experiments and analysis are restricted to the -bench benchmark. It would be valuable to evaluate IRMA across a broader range of real-world environments to assess its generalizability.
Moreover, our observations suggest that beyond the error taxonomy we proposed, -bench itself suffers from issues related to unfair reward modeling. Building a truly dynamic and reliable evaluation environment—especially one that can control for the correctness of user instructions—would have a significant impact on the field. Such an environment would enable more rigorous development and evaluation of agentic frameworks and encourage further research into robust, real-world agent behavior. Ultimately, we believe this work contributes meaningfully to the research community and provides a strong foundation for developing more reliable and consistent agentic methods for dynamic environments.
Ethics Statement
We have utilized AI assistants, specifically Grammarly and ChatGPT, to correct grammatical errors and rephrase sentences.
Acknowledgement
We thank the anonymous reviewers for their constructive suggestions. We extend our gratitude to the Research Computing (RC), and Enterprise Technology at ASU for providing computing resources, and access to the GPT API version for experiments. This work was in part supported by a gift award from Cisco Research.
Appendix A Task Definition in -bench
Following Yao et al. (2024), each task in -bench is modelled as a partially observable Markov decision process (POMDP)
[TABLE]
We briefly restate every component and specify how they instantiate in the retail and airline domains.
State space :
The hidden state is factored into where is a snapshot of the underlying database (orders, flights, balances etc.) and stores the latent user context (identity, revealed preferences, dialogue progress).
Action space :
The agent can either (i) invoke an API tool that queries or mutates the database () or (ii) send a free-form respond message to the user (). Thus .
Observation space :
After each action the environment returns either a JSON payload/error from the database () or the next user utterance produced by an LLM simulator (), yielding .
Transition function :
is deterministic for database tools (state is updated, observation is the tool output) and stochastic for respond, which calls the user simulator to sample the next utterance and potentially reveal more of the instruction.
Reward function :
At dialogue termination we compare the execution log to a gold reference: (1) hashes of mutable tables must match, (2) all mandatory natural-language outputs must appear in the agent’s responses. If both hold, , otherwise [math].
Instruction space :
Each task provides a fixed natural-language instruction describing the user goal, persona and constraints. The user simulator may disclose incrementally; therefore the agent must act under partial observability.
This causal decomposition lets us pinpoint failure modes such as wrong tool arguments (action-level), policy violations (transition-level), or hallucinated user messages (observation-level).
Appendix B Pass^k Results
B.1 Airline results
Tables 2-4 refer to pass^k results of the baselines and our implemented methods. As explained in §6.2, IRMA performs better when there are no ground-truth or user instruction errors. All results are obtained using GPT-4o as the LLM in the agent frameworks.
B.2 Retail results
Tables 5-7 represent the results of the baseline and our implemented methods in the Retail domain.
Appendix C Domain Policies
Figures 13 and 15 are the domain policies present for the retail and airline domains in the -bench. These rules are injected verbatim as the system prompt to every tool-calling agent. An agent that violates any of them—even if it successfully fulfills the user’s request—receives zero reward, so strict compliance is essential. The Tool-Calling Agent has to strictly operate under the constraints of these policies to correctly solve user requests.
Appendix D Failure Example
Figures 8 and 9 show an example of errors occurring in the conversational trajectories simulating task 19 (retail) of the user-agent interactions as enumerated in subsections of §4. Error 1 in Figure 8 shows an example of ’User Instruction Hallucination’ occurring in the very first user turn. Error 2 in Figure 9 shows an example of ’Domain Policy Violation’ error. The user instruction for Task 19 is provided in Figure 7. This ’instruction’ represents the original user instruction provided to the LLM-simulated user. It is the ‘script’ the user has to follow to provide requests to the agent.
Appendix E Input Reformulation Multi Agent framework
E.1 Follow-up question ACTing (FACT) Agent
The primary difference between FACT and other prompting techniques lies in the instruction section of the system prompt (refer to Figure 10).
Appendix F Self-Reflection Framework
To check the effectiveness of self-reflection as an alternative against the baselines and IRMA, we implement a multi-agent LLM self-reflection pipeline, consisting of a retriever LLM agent and a verifier LLM agent. Contrary to input reformulation, where the prompt provided in the user query is reformulated, the self-reflection agent pauses the tool-calling LLM agent before the execution environment executes the tool-call. All of the previous user queries are provided as input to the retriever agent to extract the relevant domain policy rules based on the user intent reflected from the user requests in the conversation. The retrieved rules are provided to the verifier agent along with the tool-calling agent’s planned tool call. The verifier agent then verifies whether the tool-call is correct by providing a reflective justification based on determining whether any domain rule has been violated or not. The overall pipeline of the self-reflection agent is provided in Figure 12. The reflective feedback loop from verifier is set to be a one-time loop as the execution of the loop is very latency-heavy and invoking it multiple times might not be ideal in real-world customer-agent scenarios.
Appendix G Ablation Studies on IRMA
We ablate the three IRMA modules—Memory (M), Constraint (C), and Tool (T)—and evaluate them on the airline subset. Across all Pass^k metrics, the full configuration (M+C+T) achieves the best performance, indicating strong complementarity among modules. Among the ablations, M+C is consistently the strongest, ranking second overall in terms of better reliability at higher values of k. This pattern suggests that instruction retention (M) and policy/constraint adherence (C) account for most gains in long-horizon reasoning and plan stability, while tool disambiguation (T) provides the additional performance improvement needed to reach state-of-the-art performance. In sum, each module targets a distinct failure mode—carryover of instructions (M), rule compliance (C), and tool selection/parameterization (T)—but their integration is necessary for robust behavior in dynamic tool-use settings. The results of the ablation experiments are provided in Table 8.
We also test IRMA using GPT-4o-mini as the LLM backbone, to test the effect of IRMA with a smaller LLM. As shown in Table 9, the results indicate that the benefits of IRMA are not tied to a particular larger model’s reasoning strength and transfer to smaller function-calling backbones. Conceptually, IRMA does not replace parametric reasoning; rather, its structured inputs—memory, constraints, and tool suggestions—amplify a model’s ability to retain instructions, follow domain rules, and disambiguate tools across long contexts, yielding more stable performance under multiple attempts.
To isolate the effect of follow-up questioning, we create a controlled variant that swaps IRMA’s system prompt with the standard ReAct prompt while keeping all information-consolidating components—memory, constraint extraction, and tool suggestions—unchanged. This “IRMA + ReAct-prompt” baseline places both agents on identical inputs and differs only in the instructions provided to the final tool-calling agent. As reported in Table 10, IRMA consistently outperforms this baseline across Pass^k metrics, indicating that targeted follow-up questioning provides gains beyond ReAct.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abdin et al. (2024) Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. 2024. Phi-4 technical report. ar Xiv preprint ar Xiv:2412.08905 .
- 2Annepaka and Pakray (2025) Yashwanth Annepaka and Prasenjit Pakray. 2025. Large language models: a survey of their development, capabilities, and applications . Knowledge and Information Systems , 67:2967–3022. · doi ↗
- 3Anthropic (2024 a) Anthropic. 2024 a. Claude 3.5 models and computer use. https://www.anthropic.com/news/3-5-models-and-computer-use . Accessed: 2025-05-20.
- 4Anthropic (2024 b) Anthropic. 2024 b. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet . 4 min read.
- 5Basu et al. (2024) Kinjal Basu, Ibrahim Abdelaziz, Kiran Kate, Mayank Agarwal, Maxwell Crouse, Yara Rizk, Kelsey Bradford, Asim Munawar, Sadhana Kumaravel, Saurabh Goyal, et al. 2024. Nestful: A benchmark for evaluating llms on nested sequences of api calls. ar Xiv preprint ar Xiv:2409.03797 .
- 6Cemri et al. (2025) Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. 2025. Why do multi-agent llm systems fail? ar Xiv preprint ar Xiv:2503.13657 .
- 7Chen et al. (2024 a) Aili Chen, Xuyang Ge, Ziquan Fu, Yanghua Xiao, and Jiangjie Chen. 2024 a. Travelagent: An ai assistant for personalized travel planning. ar Xiv preprint ar Xiv:2409.08069 .
- 8Chen et al. (2024 b) Guoxin Chen, Zhong Zhang, Xin Cong, Fangda Guo, Yesai Wu, Yankai Lin, Wenzheng Feng, and Yasheng Wang. 2024 b. Learning evolving tools for large language models. ar Xiv preprint ar Xiv:2410.06617 .
