$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World   Domains

Shunyu Yao; Noah Shinn; Pedram Razavi; Karthik Narasimhan

arXiv:2406.12045·cs.AI·June 19, 2024·5 cites

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces $ au$-bench, a benchmark for evaluating language agents' ability to interact with users and follow domain-specific rules, highlighting current limitations in consistency and reliability.

Contribution

The paper presents $ au$-bench, a new benchmark with an evaluation process and metric for assessing language agents' interaction and rule-following in real-world domains.

Findings

01

State-of-the-art agents succeed on less than 50% of tasks.

02

Agents show high inconsistency, with pass^8 below 25% in retail domains.

03

Current methods need improvement for reliable, rule-compliant agent behavior.

Abstract

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $τ$ -bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sierra-research/tau-bench
noneOfficial

Videos

The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think· youtube

Taxonomy

TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Business Process Modeling and Analysis