Investigating the Impact of Pre-trained Language Models on Dialog Evaluation
Chen Zhang, Luis Fernando D'Haro, Yiming Chen, Thomas Friedrichs,, Haizhou Li

TL;DR
This paper systematically evaluates how various pre-trained language models influence the effectiveness of automatic metrics in open-domain dialog evaluation across multiple benchmarks.
Contribution
It provides the first comprehensive analysis of the impact of different Pr-LMs on dialog evaluation metrics, considering factors like pre-training objectives and model size.
Findings
Pr-LM choice significantly affects metric performance
Model size and pre-training objectives influence evaluation robustness
Cross-dataset performance varies with different Pr-LMs
Abstract
Recently, there is a surge of interest in applying pre-trained language models (Pr-LM) in automatic open-domain dialog evaluation. Pr-LMs offer a promising direction for addressing the multi-domain evaluation challenge. Yet, the impact of different Pr-LMs on the performance of automatic metrics is not well-understood. This paper examines 8 different Pr-LMs and studies their impact on three typical automatic dialog evaluation metrics across three different dialog evaluation benchmarks. Specifically, we analyze how the choice of Pr-LMs affects the performance of automatic metrics. Extensive correlation analyses on each of the metrics are performed to assess the effects of different Pr-LMs along various axes, including pre-training objectives, dialog evaluation criteria, model size, and cross-dataset robustness. This study serves as the first comprehensive assessment of the effects of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
