TL;DR
This paper presents FED, an unsupervised, interpretable automatic evaluation metric for open-domain dialog that correlates well with human judgments, using DialoGPT without fine-tuning.
Contribution
It introduces FED, a novel unsupervised dialog evaluation metric and dataset, capable of assessing fine-grained dialog qualities without relying on ground-truth responses or training data.
Findings
FED correlates moderately to strongly with human judgments.
FED evaluates dialog quality at turn and dialog levels.
The FED dataset includes annotated human-system and human-human conversations.
Abstract
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
