LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human-LLM Judgment Gaps

Keito Inoshita; Xiaokang Zhou; Akira Kawai; Katsutoshi Yada

arXiv:2604.27345·cs.CL·May 4, 2026

LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human-LLM Judgment Gaps

Keito Inoshita, Xiaokang Zhou, Akira Kawai, Katsutoshi Yada

PDF

TL;DR

This paper investigates whether Large Language Models (LLMs) can replicate human disagreement in emotion labeling, revealing they excel with explicit lexical cues but struggle with complex, context-dependent emotions, and proposes calibration methods to improve their alignment.

Contribution

It demonstrates that LLMs primarily capture emotion labels with explicit lexical markers and introduces calibration techniques to better align LLM judgments with human distributional disagreement.

Findings

01

LLMs diverge from human emotion judgment distributions

02

Fine-tuning reduces the gap more than increasing model size

03

Calibration methods can decrease the distributional gap by up to 14%

Abstract

Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distributional information that disagreement encodes. We ask whether LLMs capture the structure of this disagreement, not just majority labels, by comparing emotion judgment distributions between human annotators and four zero-shot LLMs, plus a fine-tuned RoBERTa baseline, across two complementary benchmarks: GoEmotions and EmoBank, totaling 640,000 LLM responses. Zero-shot models diverge substantially from human distributions, and in-domain fine-tuning, not model scale, is required to close the gap. We formalize a lexical-grounding gradient through a quantitative transparency score that predicts per-category human--LLM agreement: LLMs reliably capture emotions with explicit lexical markers but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.