LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Daniel Fein; Sebastian Russo; Violet Xiang; Kabir Jolly; Rafael Rafailov; Nick Haber

arXiv:2507.00769·cs.CL·July 2, 2025

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Daniel Fein, Sebastian Russo, Violet Xiang, Kabir Jolly, Rafael Rafailov, Nick Haber

PDF

Open Access 1 Models 2 Datasets 1 Video

TL;DR

LitBench is a new benchmark and dataset designed to reliably evaluate creative writing generated by large language models, addressing the challenge of assessing open-ended narratives without ground truth.

Contribution

It introduces LitBench, the first standardized benchmark and dataset for creative writing verification, including human-labeled comparisons and trained reward models that outperform existing judges.

Findings

01

Claude-3.7-Sonnet is the best off-the-shelf judge with 73% agreement.

02

Trained reward models achieve 78% accuracy, surpassing off-the-shelf judges.

03

Reward models align well with human preferences in novel stories.

Abstract

Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ConicCat/Litbench-Creative-Writing-RM-3B
model· 471 dl· ♡ 2
471 dl♡ 2

Datasets

Videos

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing· underline

Taxonomy

TopicsArtificial Intelligence in Games · Artificial Intelligence in Healthcare and Education · Mental Health via Writing