Do Language Models Enjoy Their Own Stories? Prompting Large Language   Models for Automatic Story Evaluation

Cyril Chhun; Fabian M. Suchanek; Chlo\'e Clavel

arXiv:2405.13769·cs.CL·May 24, 2024

Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation

Cyril Chhun, Fabian M. Suchanek, Chlo\'e Clavel

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether large language models can effectively evaluate stories, comparing their ratings to human judgments and automatic measures, and analyzing the impact of prompting and explainability.

Contribution

It provides an extensive analysis of LLMs as automatic story evaluators, highlighting their strengths and limitations compared to human annotations and existing automatic measures.

Findings

01

LLMs outperform current automatic measures at system-level evaluation

02

LLMs struggle to provide satisfactory explanations for their ratings

03

Prompting influences LLM evaluation results

Abstract

Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning and deep understanding. Meanwhile, Large Language Models (LLM) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dig-team/hanna-benchmark-asg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques