Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics
Ameya Godbole, Robin Jia

TL;DR
This paper critically examines five leading factuality metrics for natural language generation, revealing their inconsistencies, biases, and unreliability across multiple datasets and tasks, urging cautious use and manual validation.
Contribution
It provides a comprehensive re-evaluation of current factuality metrics, highlighting their limitations and biases, and offers guidance for their cautious application in NLP evaluation.
Findings
Metrics are inconsistent with each other.
Metrics often misestimate system performance.
Biases against paraphrased and distant-source outputs.
Abstract
Improvements in large language models have led to increasing optimism that they can serve as reliable evaluators of natural language generation outputs. In this paper, we challenge this optimism by thoroughly re-evaluating five state-of-the-art factuality metrics on a collection of 11 datasets for summarization, retrieval-augmented generation, and question answering. We find that these evaluators are inconsistent with each other and often misestimate system-level performance, both of which can lead to a variety of pitfalls. We further show that these metrics exhibit biases against highly paraphrased outputs and outputs that draw upon faraway parts of the source documents. We urge users of these factuality metrics to proceed with caution and manually validate the reliability of these metrics in their domain of interest before proceeding.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Ethics and Social Impacts of AI
