Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors
Liyan Tang, Tanya Goyal, Alexander R. Fabbri, Philippe Laban, Jiacheng, Xu, Semih Yavuz, Wojciech Kry\'sci\'nski, Justin F. Rousseau, Greg Durrett

TL;DR
This paper evaluates the performance of various factuality metrics, including ChatGPT-based ones, across different summarization models and error types, revealing significant variability and the need for nuanced evaluation practices.
Contribution
It aggregates and stratifies factuality error annotations from multiple datasets, compares state-of-the-art metrics across models, and provides insights into their varying effectiveness and best practices.
Findings
Performance of factuality metrics varies across models and error types.
Recent improvements are mostly on older models, not recent ones.
No single metric outperforms others in all settings.
Abstract
The propensity of abstractive summarization models to make factual errors has been studied extensively, including design of metrics to detect factual errors and annotation of errors in current systems' outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has become increasingly difficult. In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models. Critically, our analysis shows that much of the recent improvement in the factuality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
MethodsBalanced Selection
