The illusion of a perfect metric: Why evaluating AI's words is harder than it looks
Maria Paz Oliva, Adriana Correia, Ivan Vankov, Viktor Botev

TL;DR
Evaluating AI-generated text is complex and no single metric reliably captures quality across tasks, highlighting the need for task-specific evaluation strategies and improved validation methods.
Contribution
This paper critically examines existing evaluation metrics for NLG, revealing their limitations and proposing task-specific metric selection and better validation practices.
Findings
Metrics often only capture specific aspects of text quality.
Effectiveness of metrics varies by task and dataset.
Validation practices for metrics are often unstructured.
Abstract
Evaluating Natural Language Generation (NLG) is crucial for the practical adoption of AI, but has been a longstanding research challenge. While human evaluation is considered the de-facto standard, it is expensive and lacks scalability. Practical applications have driven the development of various automatic evaluation metrics (AEM), designed to compare the model output with human-written references, generating a score which approximates human judgment. Over time, AEMs have evolved from simple lexical comparisons, to semantic similarity models and, more recently, to LLM-based evaluators. However, it seems that no single metric has emerged as a definitive solution, resulting in studies using different ones without fully considering the implications. This paper aims to show this by conducting a thorough examination of the methodologies of existing metrics, their documented strengths and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
