Reading Subtext: Evaluating Large Language Models on Short Story   Summarization with Writers

Melanie Subbiah; Sean Zhang; Lydia B. Chilton; Kathleen McKeown

arXiv:2403.01061·cs.CL·July 15, 2024·1 cites

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

Melanie Subbiah, Sean Zhang, Lydia B. Chilton, Kathleen McKeown

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper assesses the ability of large language models to accurately summarize complex short stories with subtext, revealing significant faithfulness issues and poor correlation between automatic metrics and author evaluations.

Contribution

It provides a novel evaluation framework involving authors' judgments and narrative theory to analyze LLM summarization of unseen, nuanced stories.

Findings

01

All models made faithfulness errors in over 50% of summaries.

02

Models struggle with specificity and interpreting subtext.

03

Automatic metrics poorly correlate with author ratings.

Abstract

We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Multi-Head Attention · Layer Normalization · Dropout · Softmax · Dense Connections · Label Smoothing · Adam