LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
Daniel Fein, Sebastian Russo, Violet Xiang, Kabir Jolly, Rafael Rafailov, Nick Haber

TL;DR
LitBench is a new benchmark and dataset designed to reliably evaluate creative writing generated by large language models, addressing the challenge of assessing open-ended narratives without ground truth.
Contribution
It introduces LitBench, the first standardized benchmark and dataset for creative writing verification, including human-labeled comparisons and trained reward models that outperform existing judges.
Findings
Claude-3.7-Sonnet is the best off-the-shelf judge with 73% agreement.
Trained reward models achieve 78% accuracy, surpassing off-the-shelf judges.
Reward models align well with human preferences in novel stories.
Abstract
Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Games · Artificial Intelligence in Healthcare and Education · Mental Health via Writing
