Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback   on Crowdworkers and LLMs

Clemencia Siro; Mohammad Aliannejadi; Maarten de Rijke

arXiv:2404.12994·cs.IR·May 1, 2024

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke

PDF

1 Repo

TL;DR

This paper investigates how user feedback, explicit or implicit, affects the evaluation of task-oriented dialogue systems by crowdworkers and LLMs, revealing significant differences and implications for future assessment methods.

Contribution

It introduces a comparative methodology for evaluating dialogue systems with and without user feedback, highlighting its impact on assessment consistency and personalization.

Findings

01

User feedback influences evaluation ratings significantly.

02

Crowdworkers are more affected by user feedback on usefulness and interestingness.

03

User feedback improves agreement among crowdworkers on complex requests.

Abstract

In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clemenciah/llmcrowddialogueeval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus