STORYSUMM: Evaluating Faithfulness in Story Summarization

Melanie Subbiah; Faisal Ladhak; Akankshya Mishra; Griffin Adams; Lydia; B. Chilton; Kathleen McKeown

arXiv:2407.06501·cs.AI·April 2, 2025

STORYSUMM: Evaluating Faithfulness in Story Summarization

Melanie Subbiah, Faisal Ladhak, Akankshya Mishra, Griffin Adams, Lydia, B. Chilton, Kathleen McKeown

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces STORYSUMM, a new dataset for evaluating faithfulness in story summarization, revealing current automatic metrics' limitations and emphasizing the need for diverse evaluation methods.

Contribution

The paper presents STORYSUMM, a novel dataset with localized faithfulness labels for stories, and demonstrates the inadequacy of existing automatic metrics in detecting inconsistencies.

Findings

01

Human annotations often miss inconsistencies in faithfulness.

02

Existing automatic metrics achieve less than 70% accuracy on the dataset.

03

Diverse evaluation approaches are necessary for reliable faithfulness assessment.

Abstract

Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, STORYSUMM, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

melaniesubbiah/storysumm
noneOfficial

Videos

STORYSUMM: Evaluating Faithfulness in Story Summarization· underline

Taxonomy

TopicsAdvanced Text Analysis Techniques · Topic Modeling · Natural Language Processing Techniques