How Far are We from Robust Long Abstractive Summarization?
Huan Yee Koh, Jiaxin Ju, He Zhang, Ming Liu, Shirui Pan

TL;DR
This paper evaluates the current state of long document abstractive summarization, highlighting the gap between relevance and factual accuracy, and proposes directions for developing better factuality metrics.
Contribution
It provides a detailed human-annotated dataset and analysis of models and metrics, revealing limitations of ROUGE and factuality measures, and suggests future research directions.
Findings
ROUGE effectively measures relevance but not factuality.
Current factuality metrics have significant limitations.
BARTScore shows promising results in factuality evaluation.
Abstract
Abstractive summarization has made tremendous progress in recent years. In this work, we perform fine-grained human annotations to evaluate long document abstractive summarization systems (i.e., models and metrics) with the aim of implementing them to generate reliable summaries. For long document abstractive models, we show that the constant strive for state-of-the-art ROUGE results can lead us to generate more relevant summaries but not factual ones. For long document evaluation metrics, human evaluation results show that ROUGE remains the best at evaluating the relevancy of a summary. It also reveals important limitations of factuality metrics in detecting different types of factual errors and the reasons behind the effectiveness of BARTScore. We then suggest promising directions in the endeavor of developing factual consistency metrics. Finally, we release our annotated long…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
