ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue Systems

Yifei Zhang; Hooshang Nayyeri; Rinat Khaziev; Emine Yilmaz; Gokhan Tur; Dilek Hakkani-T\"ur; Hari Thadakamalla

arXiv:2601.11854·cs.CL·February 2, 2026

ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue Systems

Yifei Zhang, Hooshang Nayyeri, Rinat Khaziev, Emine Yilmaz, Gokhan Tur, Dilek Hakkani-T\"ur, Hari Thadakamalla

PDF

Open Access

TL;DR

This paper introduces ATOD, a comprehensive benchmark and evaluation framework for assessing advanced agentic task-oriented dialogue systems with capabilities like multi-goal coordination, memory, and proactivity.

Contribution

We present ATOD, a novel benchmark with synthetic dialogues and ATOD-Eval, a detailed evaluation framework for measuring agentic behaviors in dialogue systems.

Findings

01

ATOD enables systematic evaluation of agentic TOD capabilities.

02

The proposed evaluator outperforms existing methods in accuracy and efficiency.

03

Experiments demonstrate the effectiveness of ATOD-Eval in comprehensive assessment.

Abstract

Recent advances in task-oriented dialogue (TOD) systems, driven by large language models (LLMs) with extensive API and tool integration, have enabled conversational agents to coordinate interleaved goals, maintain long-horizon context, and act proactively through asynchronous execution. These capabilities extend beyond traditional TOD systems, yet existing benchmarks lack systematic support for evaluating such agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-term reasoning. ATOD captures key characteristics of advanced TOD, including multi-goal coordination, dependency management, memory, adaptability, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that translates these dimensions into fine-grained metrics and supports…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Multimodal Machine Learning Applications