Shades of BLEU, Flavours of Success: The Case of MultiWOZ

Tom\'a\v{s} Nekvinda; Ond\v{r}ej Du\v{s}ek

arXiv:2106.05555·cs.CL·June 11, 2021

Shades of BLEU, Flavours of Success: The Case of MultiWOZ

Tom\'a\v{s} Nekvinda, Ond\v{r}ej Du\v{s}ek

PDF

1 Repo

TL;DR

This paper critically examines the MultiWOZ dataset's evaluation metrics, identifies inconsistencies, and provides standardized tools and recommendations to improve benchmarking of dialogue systems.

Contribution

It highlights issues in current evaluation practices, re-evaluates models with standardized scripts, and offers guidelines for fair benchmarking in future research.

Findings

01

Inconsistencies found in data preprocessing and metric reporting.

02

Re-evaluation shows reported scores are not directly comparable.

03

Provides standardized evaluation scripts for future benchmarking.

Abstract

The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarking context-to-response abilities of task-oriented dialogue systems. In this work, we identify inconsistencies in data preprocessing and reporting of three corpus-based metrics used on this dataset, i.e., BLEU score and Inform & Success rates. We point out a few problems of the MultiWOZ benchmark such as unsatisfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database. We re-evaluate 7 end-to-end and 6 policy optimization models in as-fair-as-possible setups, and we show that their reported scores cannot be directly compared. To facilitate comparison of future systems, we release our stand-alone standardized evaluation scripts. We also give basic recommendations for corpus-based benchmarking in future works.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Tomiinek/MultiWOZ_Evaluation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.