$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan

TL;DR
This paper introduces $ au$-bench, a benchmark for evaluating language agents' ability to interact with users and follow domain-specific rules, highlighting current limitations in consistency and reliability.
Contribution
The paper presents $ au$-bench, a new benchmark with an evaluation process and metric for assessing language agents' interaction and rule-following in real-world domains.
Findings
State-of-the-art agents succeed on less than 50% of tasks.
Agents show high inconsistency, with pass^8 below 25% in retail domains.
Current methods need improvement for reliable, rule-compliant agent behavior.
Abstract
Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose -bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think· youtube
Taxonomy
TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Business Process Modeling and Analysis
