SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits

Onkar Thorat; Philippe Laban; Chien-Sheng Wu

arXiv:2412.13378·cs.CL·June 3, 2025

SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits

Onkar Thorat, Philippe Laban, Chien-Sheng Wu

PDF

Open Access 1 Datasets

TL;DR

SummExecEdit introduces a new benchmark using executable edits to evaluate models on factual consistency detection and explanation in summarization, revealing current models' significant challenges and common explanation errors.

Contribution

The paper presents SummExecEdit, a novel benchmark with executable edits for assessing factual error detection and explanation in summarization models, enhancing interpretability and challenge.

Findings

01

Top model scores 0.49 joint detection and explanation

02

Over 30% of errors in 20+ LLMs

03

45.4% explanation errors focus on unrelated summary parts

Abstract

Detecting factual inconsistencies in summarization is critical, yet existing benchmarks lack the necessary challenge and interpretability for robust evaluation. In this paper, we introduce SummExecEdit, a novel pipeline and benchmark leveraging executable edits to assess models on their ability to both detect factual errors and provide accurate explanations. The top-performing model, Claude3-Opus, achieves a joint detection and explanation score of only 0.49 in our benchmark, with individual scores of 0.67 for detection and 0.73 for explanation. We conduct detailed evaluations to assess the current state of models in this field and find that more than half of the 20+ LLMs in our study struggle with over 30% of the SummExecEdit benchmark. Additionally, we identify four primary types of explanation errors, with 45.4% of them involving a focus on completely unrelated parts of the summary.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Salesforce/summexecedit
dataset· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Data Quality and Management · Semantic Web and Ontologies