A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization
KuanChao Chu, Yi-Pei Chen, Hideki Nakayama

TL;DR
This paper explores how prompt design, including output sequencing and optimization, affects the accuracy and consistency of large language models in evaluating generated texts, offering insights to improve LLM-based assessment methods.
Contribution
It systematically analyzes the impact of prompt output sequencing and optimization on LLM evaluation performance, providing guidelines for better prompt design.
Findings
Order of reasons and scores affects LLM scoring accuracy
Prompt optimization can improve scoring alignment with human judgments
Different prompt structures influence model understanding of evaluation rules
Abstract
This research investigates prompt designs of evaluating generated texts using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for open-ended text evaluation remains challenging due to model sensitivity and subjectivity in evaluation of text generation. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a different level of rule understanding in the prompt. An additional optimization may enhance scoring alignment if sufficient data is available. This insight is crucial for improving the accuracy and consistency of LLM-based evaluations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
