$\tau$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge
Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, Victor Barres

TL;DR
This paper introduces $ au$-Knowledge, a comprehensive benchmark for evaluating conversational agents' ability to retrieve and apply unstructured domain knowledge in realistic, long-horizon interactions, highlighting current limitations.
Contribution
It presents a new evaluation framework and domain, $ au$-Banking, for assessing agents' performance in complex knowledge retrieval and policy execution tasks.
Findings
Agents achieve only ~25.5% success rate in the benchmark.
Retrieval accuracy degrades sharply over repeated trials.
Agents struggle with complex, densely linked knowledge bases.
Abstract
Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce -Knowledge, an extension of -Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, -Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · AI in Service Interactions
