Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization
Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing

TL;DR
Large language models like ChatGPT and GPT-4 show promise in evaluating abstractive summarization but currently lack the consistency and reliability needed to replace human judgment, especially with high-quality summaries.
Contribution
This paper provides an extensive analysis of LLMs as automatic evaluators for abstractive summarization, highlighting their limitations and unreliability compared to human assessments.
Findings
LLMs outperform traditional automatic metrics but are inconsistent.
LLMs struggle to reliably compare closely performing summaries.
Higher-quality summaries lead to lower correlation with human judgments.
Abstract
With the recent undeniable advancement in reasoning abilities in large language models (LLMs) like ChatGPT and GPT-4, there is a growing trend for using LLMs on various tasks. One area where LLMs can be employed is as an alternative evaluation metric for complex generative tasks, which generally demands expensive human judges to complement the traditional automatic metrics for various evaluation dimensions such as fluency and consistency. In this work, we conduct extensive analysis to investigate the stability and reliability of LLMs as automatic evaluators for abstractive summarization. We found that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements due to significant limitations. That is, LLM evaluators rate each candidate system inconsistently and are dimension-dependent. They also struggle to compare candidates with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Absolute Position Encodings · Adam · Byte Pair Encoding
