LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

Yi-Pei Chen; KuanChao Chu; Hideki Nakayama

arXiv:2406.02863·cs.CL·June 6, 2024

LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

Yi-Pei Chen, KuanChao Chu, Hideki Nakayama

PDF

Open Access

TL;DR

This paper examines how prompt design, especially output order, affects large language models' ability to evaluate dialogues, revealing that ordering reasons before scores improves evaluation quality.

Contribution

It demonstrates that output order in prompts significantly impacts LLM dialogue evaluation, providing guidance for better prompt design.

Findings

01

Reason-first prompts improve evaluation consistency

02

Output order influences scoring accuracy

03

Prompt structure affects LLM evaluation quality

Abstract

This research investigates the effect of prompt design on dialogue evaluation using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for dialogue evaluation remains challenging due to model sensitivity and subjectivity in dialogue assessments. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a "reason-first" approach yielding more comprehensive evaluations. This insight is crucial for enhancing the accuracy and consistency of LLM-based evaluations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems