What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
Dingyi Yang, Qin Jin

TL;DR
This survey reviews the challenges, criteria, datasets, and metrics for evaluating stories, especially in the context of AI-generated content, highlighting the complexity and proposing future research directions.
Contribution
It provides a comprehensive taxonomy and analysis of existing story evaluation metrics, datasets, and human-AI collaboration methods, addressing the unique challenges of story assessment.
Findings
Identified key human criteria for story evaluation.
Categorized existing metrics and datasets for story assessment.
Discussed future directions including human-AI collaboration.
Abstract
With the development of artificial intelligence, particularly the success of Large Language Models (LLMs), the quantity and quality of automatically generated stories have significantly increased. This has led to the need for automatic story evaluation to assess the generative capabilities of computing systems and analyze the quality of both automatic-generated and human-written stories. Evaluating a story can be more challenging than other generation evaluation tasks. While tasks like machine translation primarily focus on assessing the aspects of fluency and accuracy, story evaluation demands complex additional measures such as overall coherence, character development, interestingness, etc. This requires a thorough review of relevant research. In this survey, we first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual. We highlight their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Storytelling and Education
MethodsFocus
