Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Piotr Sawicki; Marek Grze\'s; Dan Brown; Fabr\'icio G\'oes

arXiv:2502.19064·cs.CL·October 7, 2025

Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

Piotr Sawicki, Marek Grze\'s, Dan Brown, Fabr\'icio G\'oes

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel methodology adapting the Consensual Assessment Technique for Large Language Models, demonstrating that LLMs can outperform non-expert human judges in poetry evaluation with high reliability.

Contribution

The study presents a new comparative evaluation method for LLMs in poetry assessment, achieving high correlation with ground truth and surpassing non-expert human performance.

Findings

01

LLMs achieved a Spearman's Rank Correlation of 0.87 with ground truth.

02

LLMs outperformed non-expert human judges in poetry evaluation.

03

High inter-rater reliability was observed among LLM assessments.

Abstract

This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs), introducing a novel methodology for poetry evaluation. Using a 90-poem dataset with a ground truth based on publication venue, we demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges. Our method, which leverages forced-choice ranking within small, randomized batches, enabled Claude-3-Opus to achieve a Spearman's Rank Correlation of 0.87 with the ground truth, dramatically outperforming the best human non-expert evaluation (SRC = 0.38). The LLM assessments also exhibited high inter-rater reliability, underscoring the methodology's robustness. These findings establish that LLMs, when guided by a comparative framework, can be effective and reliable tools for assessing poetry, paving the way for their broader application in other creative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique· underline

Taxonomy

TopicsArtificial Intelligence in Games · Aesthetic Perception and Analysis · Computational and Text Analysis Methods