Verify with Caution: The Pitfalls of Relying on Imperfect Factuality   Metrics

Ameya Godbole; Robin Jia

arXiv:2501.14883·cs.CL·January 31, 2025

Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Ameya Godbole, Robin Jia

PDF

Open Access

TL;DR

This paper critically examines five leading factuality metrics for natural language generation, revealing their inconsistencies, biases, and unreliability across multiple datasets and tasks, urging cautious use and manual validation.

Contribution

It provides a comprehensive re-evaluation of current factuality metrics, highlighting their limitations and biases, and offers guidance for their cautious application in NLP evaluation.

Findings

01

Metrics are inconsistent with each other.

02

Metrics often misestimate system performance.

03

Biases against paraphrased and distant-source outputs.

Abstract

Improvements in large language models have led to increasing optimism that they can serve as reliable evaluators of natural language generation outputs. In this paper, we challenge this optimism by thoroughly re-evaluating five state-of-the-art factuality metrics on a collection of 11 datasets for summarization, retrieval-augmented generation, and question answering. We find that these evaluators are inconsistent with each other and often misestimate system-level performance, both of which can lead to a variety of pitfalls. We further show that these metrics exhibit biases against highly paraphrased outputs and outputs that draw upon faraway parts of the source documents. We urge users of these factuality metrics to proceed with caution and manually validate the reliability of these metrics in their domain of interest before proceeding.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, Economics, and Judicial Systems · Ethics and Social Impacts of AI