TL;DR
This paper introduces STORIUM, a large, richly annotated storytelling dataset and an evaluation platform that enables collaborative, human-in-the-loop assessment of story generation models, addressing current limitations in data quality and evaluation reliability.
Contribution
It provides a novel dataset with detailed annotations and an interactive platform for evaluating story generation models through human collaboration.
Findings
Automatic metrics correlate with user ratings
Models fine-tuned on STORIUM produce more plausible stories
The platform facilitates iterative story refinement
Abstract
Systems for story generation are asked to produce plausible and enjoyable stories given an input context. This task is underspecified, as a vast number of diverse stories can originate from a single input. The large output space makes it difficult to build and evaluate story generation models, as (1) existing datasets lack rich enough contexts to meaningfully guide models, and (2) existing evaluations (both crowdsourced and automatic) are unreliable for assessing long-form creative text. To address these issues, we introduce a dataset and evaluation platform built from STORIUM, an online collaborative storytelling community. Our author-generated dataset contains 6K lengthy stories (125M tokens) with fine-grained natural language annotations (e.g., character goals and attributes) interspersed throughout each narrative, forming a robust source for guiding models. We evaluate language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
