TL;DR
This paper critically examines the MultiWOZ dataset's evaluation metrics, identifies inconsistencies, and provides standardized tools and recommendations to improve benchmarking of dialogue systems.
Contribution
It highlights issues in current evaluation practices, re-evaluates models with standardized scripts, and offers guidelines for fair benchmarking in future research.
Findings
Inconsistencies found in data preprocessing and metric reporting.
Re-evaluation shows reported scores are not directly comparable.
Provides standardized evaluation scripts for future benchmarking.
Abstract
The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarking context-to-response abilities of task-oriented dialogue systems. In this work, we identify inconsistencies in data preprocessing and reporting of three corpus-based metrics used on this dataset, i.e., BLEU score and Inform & Success rates. We point out a few problems of the MultiWOZ benchmark such as unsatisfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database. We re-evaluate 7 end-to-end and 6 policy optimization models in as-fair-as-possible setups, and we show that their reported scores cannot be directly compared. To facilitate comparison of future systems, we release our stand-alone standardized evaluation scripts. We also give basic recommendations for corpus-based benchmarking in future works.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
