PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison
ChaeHun Park, Minseok Choi, Dohyun Lee, and Jaegul Choo

TL;DR
PairEval is a new open-domain dialogue evaluation metric that compares responses against each other to better align with human judgments and detect common dialogue system failures.
Contribution
It introduces a pairwise comparison approach for dialogue response evaluation, improving correlation with human judgments and robustness over existing metrics.
Findings
Higher correlation with human judgments than baseline metrics
More robust in detecting repetition and speaker insensitivity
Effective across multiple benchmark datasets
Abstract
Building a reliable and automated evaluation metric is a necessary but challenging problem for open-domain dialogue systems. Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to previous dialogue histories. Although effective, these metrics evaluate individual responses directly rather than considering their relative quality compared to other responses. To handle this, we propose PairEval, a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations. PairEval is built on top of open-sourced and moderate-size language models, and we make them specialized in pairwise comparison between dialogue responses. Extensive experiments on multiple benchmarks demonstrate that our metric exhibits a higher correlation with human judgments than baseline metrics. We also find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Topic Modeling
