LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

Yiqi Liu; Nafise Sadat Moosavi; Chenghua Lin

arXiv:2311.09766·cs.CL·June 10, 2024·2 cites

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

Yiqi Liu, Nafise Sadat Moosavi, Chenghua Lin

PDF

Open Access

TL;DR

This paper reveals that language model-based evaluation metrics tend to favor texts generated by the same underlying models, especially in reference-free settings, indicating a bias that challenges their reliability in NLP assessments.

Contribution

The study uncovers a latent bias in prominent LM-based evaluation metrics, emphasizing the need for more objective evaluation methods in NLP.

Findings

01

LM-based metrics favor their own generated texts

02

Bias is stronger in reference-free evaluations

03

Evaluation scores are influenced by model-specific factors

Abstract

Automatic evaluation of generated textual content presents an ongoing challenge within the field of NLP. Given the impressive capabilities of modern language models (LMs) across diverse NLP tasks, there is a growing trend to employ these models in creating innovative evaluation metrics for automated assessment of generation tasks. This paper investigates a pivotal question: Do language model-driven evaluation metrics inherently exhibit bias favoring texts generated by the same underlying language model? Specifically, we assess whether prominent LM-based evaluation metrics (e.g. BARTScore, T5Score, and GPTScore) demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries. These results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification