Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator
Qinyuan Cheng, Linyang Li, Guofeng Quan, Feng Gao, Xiaofeng Mou,, Xipeng Qiu

TL;DR
This paper introduces an interactive evaluation framework for task-oriented dialogue systems using a pre-trained user simulator, addressing policy mismatch issues and providing new metrics for response quality assessment.
Contribution
It presents a goal-oriented user simulator and an interactive evaluation method that better reflect real-world interactions for TOD systems.
Findings
RL-based TOD systems achieve nearly 98% inform and success rates
The proposed scores effectively measure response quality
Interactive evaluation reveals insights beyond traditional metrics
Abstract
Task-Oriented Dialogue (TOD) systems are drawing more and more attention in recent studies. Current methods focus on constructing pre-trained models or fine-tuning strategies while the evaluation of TOD is limited by a policy mismatch problem. That is, during evaluation, the user utterances are from the annotated dataset while these utterances should interact with previous responses which can have many alternatives besides annotated texts. Therefore, in this work, we propose an interactive evaluation framework for TOD. We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues. Besides, we introduce a sentence-level and a session-level score to measure the sentence fluency and session coherence in the interactive evaluation. Experimental results show that RL-based TOD systems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · AI in Service Interactions
