How to Choose How to Choose Your Chatbot: A Massively Multi-System   MultiReference Data Set for Dialog Metric Evaluation

Huda Khayrallah; Zuhaib Akhtar; Edward Cohen; Jyothir S V; Jo\~ao; Sedoc

arXiv:2305.14533·cs.CL·November 20, 2024·1 cites

How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation

Huda Khayrallah, Zuhaib Akhtar, Edward Cohen, Jyothir S V, Jo\~ao, Sedoc

PDF

Open Access

TL;DR

This paper introduces MMSMR, a large multi-system multi-reference dataset for dialog evaluation, aiming to improve the robustness of automatic metrics in reflecting human judgments.

Contribution

The paper presents MMSMR, a novel multi-reference dialog dataset, and evaluates 1750 systems to analyze metric robustness and dataset requirements.

Findings

01

MMSMR improves robustness in dialog metric evaluation.

02

Evaluation of 1750 systems demonstrates dataset's effectiveness.

03

Release of comprehensive dataset and system outputs for future research.

Abstract

We release MMSMR, a Massively Multi-System MultiReference dataset to enable future work on metrics and evaluation for dialog. Automatic metrics for dialogue evaluation should be robust proxies for human judgments; however, the verification of robustness is currently far from satisfactory. To quantify the robustness correlation and understand what is necessary in a test set, we create and release an 8-reference dialog dataset by extending single-reference evaluation sets and introduce this new language learning conversation dataset. We then train 1750 systems and evaluate them on our novel test set and the DailyDialog dataset. We release the novel test set, and model hyper parameters, inference outputs, and metric scores for each system on a variety of datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques

MethodsTest