Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling
Peiqi Sui, Yutong Zhu, Tianyi Cheng, Peter West, Richard Jean So, Hoyt Long, Ari Holtzman

TL;DR
This paper introduces the 100-Endings metric to evaluate narrative tension in stories generated by LLMs, addressing shortcomings of existing benchmarks and improving story quality assessment.
Contribution
It proposes a novel tension metric based on sentence-by-sentence prediction failures and a story-generation pipeline that enhances narrative tension grounded in narratology.
Findings
100-Endings ranks human stories above LLM outputs.
The pipeline increases narrative tension without sacrificing overall story quality.
The metric captures story twists and revelations effectively.
Abstract
LLMs have so far failed both to generate consistently compelling stories and to recognize this failure--on the leading creative-writing benchmark (EQ-Bench), LLM judges rank zero-shot AI stories above New Yorker short stories, a gold standard for literary fiction. We argue that existing rubrics overlook a key dimension of compelling human stories: narrative tension. We introduce the 100-Endings metric, which walks through a story sentence by sentence: at each position, a model predicts how the story will end 100 times given only the text so far, and we measure tension as how often predictions fail to match the ground truth. Beyond the mismatch rate, the sentence-level curve yields complementary statistics, such as inflection rate, a geometric measure of how frequently the curve reverses direction, tracking twists and revelations. Unlike rubric-based judges, 100-Endings correctly ranks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
