TL;DR
This paper systematically evaluates the reliability of six reference-free factuality metrics for long-document summarization, revealing their limitations and proposing directions for improvement.
Contribution
It provides a comprehensive analysis of existing metrics' robustness in long-form summarization and offers concrete suggestions for enhancing factuality evaluation methods.
Findings
Existing metrics are inconsistent for semantically equivalent summaries.
Metrics' reliability declines with information-dense claims.
Expanding retrieval context improves stability in some cases.
Abstract
Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
