Can GPT replace human raters? Validity and reliability of machine-generated norms for metaphors

Veronica Mangiaterra; Hamad Al-Azary; Chiara Barattieri di San Pietro; Paolo Canal; Valentina Bambini

arXiv:2512.12444·cs.CL·December 16, 2025

Can GPT replace human raters? Validity and reliability of machine-generated norms for metaphors

Veronica Mangiaterra, Hamad Al-Azary, Chiara Barattieri di San Pietro, Paolo Canal, Valentina Bambini

PDF

Open Access

TL;DR

This study evaluates the validity and reliability of GPT-generated ratings for metaphors in terms of familiarity, comprehensibility, and imageability, comparing them to human ratings and behavioral responses, across English and Italian datasets.

Contribution

It provides the first comprehensive validation of GPT models for rating complex metaphors, demonstrating their potential to replace or augment human raters in psycholinguistic research.

Findings

01

GPT ratings correlate with human ratings, especially in familiarity and imageability.

02

Larger GPT models outperform smaller ones in rating accuracy.

03

GPT ratings predict behavioral and electrophysiological responses effectively.

Abstract

As Large Language Models (LLMs) are increasingly being used in scientific research, the issue of their trustworthiness becomes crucial. In psycholinguistics, LLMs have been recently employed in automatically augmenting human-rated datasets, with promising results obtained by generating ratings for single words. Yet, performance for ratings of complex items, i.e., metaphors, is still unexplored. Here, we present the first assessment of the validity and reliability of ratings of metaphors on familiarity, comprehensibility, and imageability, generated by three GPT models for a total of 687 items gathered from the Italian Figurative Archive and three English studies. We performed a thorough validation in terms of both alignment with human data and ability to predict behavioral and electrophysiological responses. We found that machine-generated ratings positively correlated with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeurobiology of Language and Bilingualism · Action Observation and Synchronization · Language, Metaphor, and Cognition