Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
Sanjana Ramprasad, Byron C. Wallace

TL;DR
This paper critically evaluates automatic factuality metrics for summarization, revealing their limitations in detecting deep factual inconsistencies and their vulnerability to manipulation, thus questioning their reliability.
Contribution
The study systematically tests various factuality metrics, including LLM-based prompts, exposing their weaknesses and potential for gaming, and highlights the need for more robust evaluation methods.
Findings
Metrics perform poorly on complex, reasoning-required cases.
Some metrics are sensitive to benign edits, not factual errors.
Prompt-based LLM assessments can be manipulated by content-free additions.
Abstract
Modern LLMs can now produce highly readable abstractive summaries, to the point that traditional automated metrics for evaluating summary quality, such as ROUGE, have saturated. However, LLMs still sometimes introduce inaccuracies into summaries, i.e., information inconsistent with or unsupported by the corresponding source. Measuring the occurrence of these often subtle factual inconsistencies automatically has proved challenging. This in turn has motivated development of metrics intended to measure the factual consistency of generated summaries against sources. But are these approaches measuring what they purport to? Or are they mostly exploiting artifacts? In this work, we stress test a range of automatic factuality metrics, including specialized models and LLM-based prompting methods, to probe what they actually capture. Using a shallow classifier to separate ``easy'' examples for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Law
