# Investigating Evaluation of Open-Domain Dialogue Systems With Human   Generated Multiple References

**Authors:** Prakhar Gupta, Shikib Mehri, Tiancheng Zhao, Amy Pavel, Maxine, Eskenazi, and Jeffrey P. Bigham

arXiv: 1907.10568 · 2019-09-10

## TL;DR

This paper explores multi-reference evaluation to improve automatic assessment of open-domain dialogue systems, demonstrating better correlation with human judgment by augmenting test data with multiple references.

## Contribution

It introduces a multi-reference evaluation approach for dialogue systems and shows its effectiveness in aligning automatic metrics with human judgments.

## Key findings

- Multi-reference evaluation improves correlation with human judgment.
- Augmenting test sets with multiple references enhances metric reliability.
- The approach benefits both quality and diversity assessments.

## Abstract

The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in open-domain dialog. One alternative is to collect human annotations for evaluation, which can be expensive and time consuming. To demonstrate the effectiveness of multi-reference evaluation, we augment the test set of DailyDialog with multiple references. A series of experiments show that the use of multiple references results in improved correlation between several automatic metrics and human judgement for both the quality and the diversity of system output.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.10568/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1907.10568/full.md

## References

42 references — full list in the complete paper: https://tomesphere.com/paper/1907.10568/full.md

---
Source: https://tomesphere.com/paper/1907.10568