Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation
Sarah E. Finch, James D. Finch, Jinho D. Choi

TL;DR
This study investigates how different human evaluator groups influence the assessment of chat-oriented dialogue systems, finding that evaluation robustness varies across metrics and evaluator expertise levels.
Contribution
It provides a comprehensive analysis of evaluator group effects on dialogue system evaluation, highlighting the importance of evaluator expertise and objectivity.
Findings
Likert evaluations are robust across evaluator groups.
Pairwise evaluations show minor differences with evaluator changes.
Evaluator expertise impacts objectivity and assessment consistency.
Abstract
Human evaluation has been widely accepted as the standard for evaluating chat-oriented dialogue systems. However, there is a significant variation in previous work regarding who gets recruited as evaluators. Evaluator groups such as domain experts, university students, and professional annotators have been used to assess and compare dialogue systems, although it is unclear to what extent the choice of an evaluator group can affect results. This paper analyzes the evaluator group impact on dialogue system evaluation by testing 4 state-of-the-art dialogue systems using 4 distinct evaluator groups. Our analysis reveals a robustness towards evaluator groups for Likert evaluations that is not seen for Pairwise, with only minor differences observed when changing evaluator groups. Furthermore, two notable limitations to this robustness are observed, which reveal discrepancies between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · AI in Service Interactions
