Can Large Language Models Be an Alternative to Human Evaluations?

Cheng-Han Chiang; Hung-yi Lee

arXiv:2305.01937·cs.CL·May 4, 2023·31 cites

Can Large Language Models Be an Alternative to Human Evaluations?

Cheng-Han Chiang, Hung-yi Lee

PDF

Open Access

TL;DR

This paper investigates whether large language models can replace human evaluators in assessing text quality, demonstrating that LLM-based evaluations align with expert judgments and are more stable and reproducible.

Contribution

The study introduces LLM evaluation as a novel method for assessing text quality, showing its consistency with human evaluations across multiple NLP tasks.

Findings

01

LLM evaluation correlates well with human expert ratings

02

Results are stable across different task instruction formats

03

LLMs can reliably evaluate open-ended and adversarial texts

Abstract

Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms. Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation. We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation. We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)