$\tau$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Quan Shi; Alexandra Zytek; Pedram Razavi; Karthik Narasimhan; Victor Barres

arXiv:2603.04370·cs.AI·March 5, 2026

$\tau$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, Victor Barres

PDF

Open Access

TL;DR

This paper introduces $ au$-Knowledge, a comprehensive benchmark for evaluating conversational agents' ability to retrieve and apply unstructured domain knowledge in realistic, long-horizon interactions, highlighting current limitations.

Contribution

It presents a new evaluation framework and domain, $ au$-Banking, for assessing agents' performance in complex knowledge retrieval and policy execution tasks.

Findings

01

Agents achieve only ~25.5% success rate in the benchmark.

02

Retrieval accuracy degrades sharply over repeated trials.

03

Agents struggle with complex, densely linked knowledge bases.

Abstract

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce $τ$ -Knowledge, an extension of $τ$ -Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, $τ$ -Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · AI in Service Interactions