LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human-LLM Judgment Gaps
Keito Inoshita, Xiaokang Zhou, Akira Kawai, Katsutoshi Yada

TL;DR
This paper investigates whether Large Language Models (LLMs) can replicate human disagreement in emotion labeling, revealing they excel with explicit lexical cues but struggle with complex, context-dependent emotions, and proposes calibration methods to improve their alignment.
Contribution
It demonstrates that LLMs primarily capture emotion labels with explicit lexical markers and introduces calibration techniques to better align LLM judgments with human distributional disagreement.
Findings
LLMs diverge from human emotion judgment distributions
Fine-tuning reduces the gap more than increasing model size
Calibration methods can decrease the distributional gap by up to 14%
Abstract
Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distributional information that disagreement encodes. We ask whether LLMs capture the structure of this disagreement, not just majority labels, by comparing emotion judgment distributions between human annotators and four zero-shot LLMs, plus a fine-tuned RoBERTa baseline, across two complementary benchmarks: GoEmotions and EmoBank, totaling 640,000 LLM responses. Zero-shot models diverge substantially from human distributions, and in-domain fine-tuning, not model scale, is required to close the gap. We formalize a lexical-grounding gradient through a quantitative transparency score that predicts per-category human--LLM agreement: LLMs reliably capture emotions with explicit lexical markers but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
