Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang, Hung-yi Lee

TL;DR
This paper investigates whether large language models can replace human evaluators in assessing text quality, demonstrating that LLM-based evaluations align with expert judgments and are more stable and reproducible.
Contribution
The study introduces LLM evaluation as a novel method for assessing text quality, showing its consistency with human evaluations across multiple NLP tasks.
Findings
LLM evaluation correlates well with human expert ratings
Results are stable across different task instruction formats
LLMs can reliably evaluate open-ended and adversarial texts
Abstract
Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms. Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation. We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation. We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
