Understanding Factual Errors in Summarization: Errors, Summarizers,   Datasets, Error Detectors

Liyan Tang; Tanya Goyal; Alexander R. Fabbri; Philippe Laban; Jiacheng; Xu; Semih Yavuz; Wojciech Kry\'sci\'nski; Justin F. Rousseau; Greg Durrett

arXiv:2205.12854·cs.CL·May 29, 2023·5 cites

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

Liyan Tang, Tanya Goyal, Alexander R. Fabbri, Philippe Laban, Jiacheng, Xu, Semih Yavuz, Wojciech Kry\'sci\'nski, Justin F. Rousseau, Greg Durrett

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper evaluates the performance of various factuality metrics, including ChatGPT-based ones, across different summarization models and error types, revealing significant variability and the need for nuanced evaluation practices.

Contribution

It aggregates and stratifies factuality error annotations from multiple datasets, compares state-of-the-art metrics across models, and provides insights into their varying effectiveness and best practices.

Findings

01

Performance of factuality metrics varies across models and error types.

02

Recent improvements are mostly on older models, not recent ones.

03

No single metric outperforms others in all settings.

Abstract

The propensity of abstractive summarization models to make factual errors has been studied extensively, including design of metrics to detect factual errors and annotation of errors in current systems' outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has become increasingly difficult. In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models. Critically, our analysis shows that much of the recent improvement in the factuality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liyan06/aggrefact
noneOfficial

Models

🤗
vectara/hallucination_evaluation_model
model· 72k dl· ♡ 348
72k dl♡ 348

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies

MethodsBalanced Selection