Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue   Evaluation

Sarah E. Finch; James D. Finch; Jinho D. Choi

arXiv:2309.07998·cs.CL·September 18, 2023

Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation

Sarah E. Finch, James D. Finch, Jinho D. Choi

PDF

Open Access 1 Repo

TL;DR

This study investigates how different human evaluator groups influence the assessment of chat-oriented dialogue systems, finding that evaluation robustness varies across metrics and evaluator expertise levels.

Contribution

It provides a comprehensive analysis of evaluator group effects on dialogue system evaluation, highlighting the importance of evaluator expertise and objectivity.

Findings

01

Likert evaluations are robust across evaluator groups.

02

Pairwise evaluations show minor differences with evaluator changes.

03

Evaluator expertise impacts objectivity and assessment consistency.

Abstract

Human evaluation has been widely accepted as the standard for evaluating chat-oriented dialogue systems. However, there is a significant variation in previous work regarding who gets recruited as evaluators. Evaluator groups such as domain experts, university students, and professional annotators have been used to assess and compare dialogue systems, although it is unclear to what extent the choice of an evaluator group can affect results. This paper analyzes the evaluator group impact on dialogue system evaluation by testing 4 state-of-the-art dialogue systems using 4 distinct evaluator groups. Our analysis reveals a robustness towards evaluator groups for Likert evaluations that is not seen for Pairwise, with only minor differences observed when changing evaluator groups. Furthermore, two notable limitations to this robustness are observed, which reveal discrepancies between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sfillwo/dialogueeval-annotatorimpact
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · AI in Service Interactions