Stress Testing Factual Consistency Metrics for Long-Document Summarization

Zain Muhammad Mujahid; Dustin Wright; Isabelle Augenstein

arXiv:2511.07689·cs.CL·April 30, 2026

Stress Testing Factual Consistency Metrics for Long-Document Summarization

Zain Muhammad Mujahid, Dustin Wright, Isabelle Augenstein

PDF

1 Repo

TL;DR

This paper systematically evaluates the reliability of six reference-free factuality metrics for long-document summarization, revealing their limitations and proposing directions for improvement.

Contribution

It provides a comprehensive analysis of existing metrics' robustness in long-form summarization and offers concrete suggestions for enhancing factuality evaluation methods.

Findings

01

Existing metrics are inconsistent for semantically equivalent summaries.

02

Metrics' reliability declines with information-dense claims.

03

Expanding retrieval context improves stability in some cases.

Abstract

Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zainmujahid/metricEval-longSum
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.