LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
Yiqi Liu, Nafise Sadat Moosavi, Chenghua Lin

TL;DR
This paper reveals that language model-based evaluation metrics tend to favor texts generated by the same underlying models, especially in reference-free settings, indicating a bias that challenges their reliability in NLP assessments.
Contribution
The study uncovers a latent bias in prominent LM-based evaluation metrics, emphasizing the need for more objective evaluation methods in NLP.
Findings
LM-based metrics favor their own generated texts
Bias is stronger in reference-free evaluations
Evaluation scores are influenced by model-specific factors
Abstract
Automatic evaluation of generated textual content presents an ongoing challenge within the field of NLP. Given the impressive capabilities of modern language models (LMs) across diverse NLP tasks, there is a growing trend to employ these models in creating innovative evaluation metrics for automated assessment of generation tasks. This paper investigates a pivotal question: Do language model-driven evaluation metrics inherently exhibit bias favoring texts generated by the same underlying language model? Specifically, we assess whether prominent LM-based evaluation metrics (e.g. BARTScore, T5Score, and GPTScore) demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries. These results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
