Large Language Models are Not Yet Human-Level Evaluators for Abstractive   Summarization

Chenhui Shen; Liying Cheng; Xuan-Phi Nguyen; Yang You; Lidong Bing

arXiv:2305.13091·cs.CL·October 23, 2023·5 cites

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing

PDF

Open Access 1 Repo

TL;DR

Large language models like ChatGPT and GPT-4 show promise in evaluating abstractive summarization but currently lack the consistency and reliability needed to replace human judgment, especially with high-quality summaries.

Contribution

This paper provides an extensive analysis of LLMs as automatic evaluators for abstractive summarization, highlighting their limitations and unreliability compared to human assessments.

Findings

01

LLMs outperform traditional automatic metrics but are inconsistent.

02

LLMs struggle to reliably compare closely performing summaries.

03

Higher-quality summaries lead to lower correlation with human judgments.

Abstract

With the recent undeniable advancement in reasoning abilities in large language models (LLMs) like ChatGPT and GPT-4, there is a growing trend for using LLMs on various tasks. One area where LLMs can be employed is as an alternative evaluation metric for complex generative tasks, which generally demands expensive human judges to complement the traditional automatic metrics for various evaluation dimensions such as fluency and consistency. In this work, we conduct extensive analysis to investigate the stability and reliability of LLMs as automatic evaluators for abstractive summarization. We found that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements due to significant limitations. That is, LLM evaluators rate each candidate system inconsistently and are dimension-dependent. They also struggle to compare candidates with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

damo-nlp-sg/llm_summeval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Absolute Position Encodings · Adam · Byte Pair Encoding