Do Large Language Models Judge Error Severity Like Humans?
Diege Sun, Guanyi Chen, Zhao Fan, Xiaorong Cheng, Tingting He

TL;DR
This study compares human and LLM assessments of error severity in image descriptions, revealing that most LLMs do not accurately replicate human judgments, with some models showing surprising alignment especially in unimodal settings.
Contribution
It systematically evaluates and compares human and LLM error severity judgments across different error types and modalities, highlighting the strengths and limitations of current models.
Findings
Most LLMs assign low severity to gender errors but high to colour errors, unlike humans.
DeepSeek-V3, a unimodal LLM, aligns closely with human judgments across conditions.
Only Doubao replicates human-like error severity ranking but lacks clear error type distinction.
Abstract
Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Language and cultural evolution
