LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation
Yi-Pei Chen, KuanChao Chu, Hideki Nakayama

TL;DR
This paper examines how prompt design, especially output order, affects large language models' ability to evaluate dialogues, revealing that ordering reasons before scores improves evaluation quality.
Contribution
It demonstrates that output order in prompts significantly impacts LLM dialogue evaluation, providing guidance for better prompt design.
Findings
Reason-first prompts improve evaluation consistency
Output order influences scoring accuracy
Prompt structure affects LLM evaluation quality
Abstract
This research investigates the effect of prompt design on dialogue evaluation using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for dialogue evaluation remains challenging due to model sensitivity and subjectivity in dialogue assessments. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a "reason-first" approach yielding more comprehensive evaluations. This insight is crucial for enhancing the accuracy and consistency of LLM-based evaluations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
