$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres; Honghua Dong; Soham Ray; Xujie Si; Karthik Narasimhan

arXiv:2506.07982·cs.AI·June 10, 2025

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces $ au^2$-bench, a new benchmark for evaluating conversational agents in a dual-control environment where both agent and user actively modify a shared world, addressing limitations of existing single-control benchmarks.

Contribution

The paper presents a novel dual-control domain modeled as a Dec-POMDP, a compositional task generator, a reliable user simulator, and detailed performance analysis methods for conversational agents.

Findings

01

Significant performance drops when shifting from no-user to dual-control scenarios.

02

The benchmark effectively tests agent coordination and communication in shared environments.

03

The environment highlights challenges in guiding users and reasoning under dual-control conditions.

Abstract

Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $τ^{2}$ -bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Jarrodbarnes/tau2-sft-v4-dataset
dataset· 147 dl
147 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Social Robot Interaction and HRI · AI in Service Interactions