SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits
Onkar Thorat, Philippe Laban, Chien-Sheng Wu

TL;DR
SummExecEdit introduces a new benchmark using executable edits to evaluate models on factual consistency detection and explanation in summarization, revealing current models' significant challenges and common explanation errors.
Contribution
The paper presents SummExecEdit, a novel benchmark with executable edits for assessing factual error detection and explanation in summarization models, enhancing interpretability and challenge.
Findings
Top model scores 0.49 joint detection and explanation
Over 30% of errors in 20+ LLMs
45.4% explanation errors focus on unrelated summary parts
Abstract
Detecting factual inconsistencies in summarization is critical, yet existing benchmarks lack the necessary challenge and interpretability for robust evaluation. In this paper, we introduce SummExecEdit, a novel pipeline and benchmark leveraging executable edits to assess models on their ability to both detect factual errors and provide accurate explanations. The top-performing model, Claude3-Opus, achieves a joint detection and explanation score of only 0.49 in our benchmark, with individual scores of 0.67 for detection and 0.73 for explanation. We conduct detailed evaluations to assess the current state of models in this field and find that more than half of the 20+ LLMs in our study struggle with over 30% of the SummExecEdit benchmark. Additionally, we identify four primary types of explanation errors, with 45.4% of them involving a focus on completely unrelated parts of the summary.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Data Quality and Management · Semantic Web and Ontologies
