TL;DR
This paper investigates how user feedback, explicit or implicit, affects the evaluation of task-oriented dialogue systems by crowdworkers and LLMs, revealing significant differences and implications for future assessment methods.
Contribution
It introduces a comparative methodology for evaluating dialogue systems with and without user feedback, highlighting its impact on assessment consistency and personalization.
Findings
User feedback influences evaluation ratings significantly.
Crowdworkers are more affected by user feedback on usefulness and interestingness.
User feedback improves agreement among crowdworkers on complex requests.
Abstract
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
