What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation
Dingyi Yang, Qin Jin

TL;DR
This paper systematically investigates how to evaluate book-length stories, introduces a large benchmark dataset, compares evaluation methods, and proposes a new model that outperforms existing commercial models in aligning with human judgments.
Contribution
It introduces the first large-scale benchmark for long story evaluation, analyzes key evaluation aspects, compares different evaluation methods, and proposes a novel model that improves alignment with human evaluations.
Findings
Aggregation- and summary-based evaluations perform better.
Summary-based evaluation offers greater efficiency.
Proposed NovelCritique model outperforms GPT-4o in alignment with human judgments.
Abstract
In this work, we conduct systematic research in a challenging area: the automatic evaluation of book-length stories (>100K tokens). Our study focuses on two key questions: (1) understanding which evaluation aspects matter most to readers, and (2) exploring effective methods for evaluating lengthy stories. We introduce the first large-scale benchmark, LongStoryEval, comprising 600 newly published books with an average length of 121K tokens (maximum 397K). Each book includes its average rating and multiple reader reviews, presented as critiques organized by evaluation aspects. By analyzing all user-mentioned aspects, we propose an evaluation criteria structure and conduct experiments to identify the most significant aspects among the 8 top-level criteria. For evaluation methods, we compare the effectiveness of three types: aggregation-based, incremental-updated, and summary-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsArtificial Intelligence in Games · Digital Humanities and Scholarship · Topic Modeling
