ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling
Dechuan Teng, Chunlin Lu, Libo Qin, Wanxiang Che

TL;DR
ESAinsTOD is a unified framework that enhances task-oriented dialog systems by making them schema-aware and instruction-following, enabling better generalization, robustness, and adaptability across diverse datasets and low-resource scenarios.
Contribution
The paper introduces a structured end-to-end schema-aware instruction-tuning framework that improves adaptability and performance of task-oriented dialog models beyond traditional fine-tuning methods.
Findings
Outperforms state-of-the-art on multiple dialog benchmarks
Shows strong zero-shot generalization in low-resource settings
Enhances robustness against data noise and cascading errors
Abstract
Existing end-to-end modeling methods for modular task-oriented dialog systems are typically tailored to specific datasets, making it challenging to adapt to new dialog scenarios. In this work, we propose ESAinsTOD, a unified End-to-end Schema-Aware Instruction-tuning framework for general Task-Oriented Dialog modeling. This framework introduces a structured methodology to go beyond simply fine-tuning Large Language Models (LLMs), enabling flexible adaptation to various dialogue task flows and schemas. Specifically, we leverage full-parameter fine-tuning of LLMs and introduce two alignment mechanisms to make the resulting system both instruction-aware and schema-aware: (i) instruction alignment, which ensures that the system faithfully follows task instructions to complete various task flows from heterogeneous TOD datasets; and (ii) schema alignment, which encourages the system to make…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications
