TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

Sarik Ghazarian; Abhinav Gullapalli; Swair Shah; Anurag Beniwal; Nanyun Peng; Narayanan Sadagopan; Zhou Yu

arXiv:2511.15976·cs.CL·November 21, 2025

TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

Sarik Ghazarian, Abhinav Gullapalli, Swair Shah, Anurag Beniwal, Nanyun Peng, Narayanan Sadagopan, Zhou Yu

PDF

Open Access

TL;DR

This paper introduces TOD-ProcBench, a comprehensive benchmark for evaluating large language models' ability to understand and follow complex, multi-level instructions in task-oriented dialogues, addressing limitations of previous simplified benchmarks.

Contribution

We propose TOD-ProcBench, a challenging, multi-task benchmark with complex instructions and constraints, designed to systematically assess LLMs' instruction-following in multi-turn dialogues.

Findings

01

LLMs show varied performance across tasks

02

Multilingual settings impact instruction compliance

03

Instruction text format influences model understanding

Abstract

In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Speech and dialogue systems