Do Large Language Models Judge Error Severity Like Humans?

Diege Sun; Guanyi Chen; Zhao Fan; Xiaorong Cheng; Tingting He

arXiv:2506.05142·cs.CL·June 10, 2025

Do Large Language Models Judge Error Severity Like Humans?

Diege Sun, Guanyi Chen, Zhao Fan, Xiaorong Cheng, Tingting He

PDF

Open Access

TL;DR

This study compares human and LLM assessments of error severity in image descriptions, revealing that most LLMs do not accurately replicate human judgments, with some models showing surprising alignment especially in unimodal settings.

Contribution

It systematically evaluates and compares human and LLM error severity judgments across different error types and modalities, highlighting the strengths and limitations of current models.

Findings

01

Most LLMs assign low severity to gender errors but high to colour errors, unlike humans.

02

DeepSeek-V3, a unimodal LLM, aligns closely with human judgments across conditions.

03

Only Doubao replicates human-like error severity ranking but lacks clear error type distinction.

Abstract

Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Language and cultural evolution