Emergence of Hierarchical Emotion Organization in Large Language Models
Bo Zhao, Maya Okawa, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, Hidenori Tanaka

TL;DR
This paper investigates how large language models naturally develop hierarchical emotional structures similar to psychological models, revealing their emergent emotional reasoning and biases across social groups.
Contribution
It demonstrates that LLMs form hierarchical emotion trees aligned with human psychology and uncovers biases in emotion recognition related to socioeconomic and intersectional identities.
Findings
LLMs form hierarchical emotion trees similar to psychological models.
Larger models develop more complex emotional hierarchies.
Systematic biases in emotion recognition are present across social groups.
Abstract
As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels -- a psychological framework that argues emotions organize hierarchically -- we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Novel, interpretable methodology grounded in psychology. The paper proposes a clear tree-construction procedure from next-token distributions and aligns the resulting hierarchies with established emotion frameworks (e.g., emotion wheels). This bridges LLM evaluation with cognitive science in a way that is both conceptually sound and easy to visualize. 2. Convincing multi-scale analysis showing emergence with model size. Evaluating models from small to very large parameters demonstrates a cons
1. Dependence on LLM-generated data: Core experiments use GPT-4-generated scenarios, risking transfer of stylistic and demographic biases. Stronger validation on human-annotated datasets (e.g., GoEmotions) is needed. 2. Unvalidated modeling assumption: The use of next-token probabilities as proxies for P(emotion∣scenario) is not empirically tested. Comparisons with probing or clustering methods could substantiate it. 3. Evaluation mismatch: Humans used a 6-way classification task while models op
The method from Section is a good method and an interesting way to extract emotional hierarchies from LLMs.
The paper presents its results with a lack of clarity and care for the reader. The methodological contribution is never properly evaluated against a ground truth. The text of the paper contradicts the results presented. This paper should and needs to be rewritten for it to be an interesting and meaningful contribution. My suggestion is to focus on the method from section 3 and flesh it out. Furthermore, if the author wants to claim model bias, they should (a) make sure that the stated bias align
- Novelty: The paper creatively uses the emotion wheel—a psychological model—to evaluate hierarchical emotion recognition in LLMs. - Relevance: As LLMs increasingly engage in human-like dialogue, studying their emotional recognition and biases is timely and practically valuable. - Clarity: The presentation is clear and easy to follow. The logical flow helps readers understand both the method and the experiment design. - Comprehensive analysis: The authors test multiple settings, including dif
- Some related works, such as [1], are neglected. - The analysis of demographic bias (e.g., Figure 7) only covers two emotions (“anger” and “fear”). It would be better to include overall accuracy or more emotion categories to show whether the overall bias is consistent. - The evaluation data are primarily generated by GPT-4, raising questions about whether the performance and conclusions remain valid for real data produced by humans. - The hierarchical emotion relations are derived solely fro
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
