Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large   Language Models on Sequence to Sequence Tasks

Andrea Sottana; Bin Liang; Kai Zou; Zheng Yuan

arXiv:2310.13800·cs.CL·October 24, 2023·1 cites

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan

PDF

Open Access 1 Repo

TL;DR

This paper assesses the reliability of current automatic evaluation metrics for large language models, comparing them with human judgments across multiple NLP tasks and exploring GPT-4's potential as an evaluator.

Contribution

It provides a hybrid evaluation of LLMs on NLP benchmarks, highlighting the discrepancies between automatic metrics, human judgment, and GPT-4's evaluation capabilities.

Findings

01

ChatGPT outperforms many models according to human reviewers.

02

Automatic metrics poorly correlate with human judgments.

03

GPT-4 reasonably aligns with human evaluations across tasks.

Abstract

Large Language Models (LLMs) evaluation is a patchy and inconsistent landscape, and it is becoming clear that the quality of automatic evaluation metrics is not keeping up with the pace of development of generative models. We aim to improve the understanding of current models' performance by providing a preliminary and hybrid evaluation on a range of open and closed-source generative LLMs on three NLP benchmarks: text summarisation, text simplification and grammatical error correction (GEC), using both automatic and human evaluation. We also explore the potential of the recently released GPT-4 to act as an evaluator. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics, while scoring much more poorly when using classic automatic evaluation metrics. We also find that human reviewers rate the gold reference as much…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

protagolabs/seq2seq_llm_evaluation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Absolute Position Encodings · Adam · Label Smoothing · Residual Connection