Emergence of Hierarchical Emotion Organization in Large Language Models

Bo Zhao; Maya Okawa; Eric J. Bigelow; Rose Yu; Tomer Ullman; Ekdeep Singh Lubana; Hidenori Tanaka

arXiv:2507.10599·cs.CL·July 16, 2025

Emergence of Hierarchical Emotion Organization in Large Language Models

Bo Zhao, Maya Okawa, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, Hidenori Tanaka

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how large language models naturally develop hierarchical emotional structures similar to psychological models, revealing their emergent emotional reasoning and biases across social groups.

Contribution

It demonstrates that LLMs form hierarchical emotion trees aligned with human psychology and uncovers biases in emotion recognition related to socioeconomic and intersectional identities.

Findings

01

LLMs form hierarchical emotion trees similar to psychological models.

02

Larger models develop more complex emotional hierarchies.

03

Systematic biases in emotion recognition are present across social groups.

Abstract

As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels -- a psychological framework that argues emotions organize hierarchically -- we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. Novel, interpretable methodology grounded in psychology. The paper proposes a clear tree-construction procedure from next-token distributions and aligns the resulting hierarchies with established emotion frameworks (e.g., emotion wheels). This bridges LLM evaluation with cognitive science in a way that is both conceptually sound and easy to visualize. 2. Convincing multi-scale analysis showing emergence with model size. Evaluating models from small to very large parameters demonstrates a cons

Weaknesses

1. Dependence on LLM-generated data: Core experiments use GPT-4-generated scenarios, risking transfer of stylistic and demographic biases. Stronger validation on human-annotated datasets (e.g., GoEmotions) is needed. 2. Unvalidated modeling assumption: The use of next-token probabilities as proxies for P(emotion∣scenario) is not empirically tested. Comparisons with probing or clustering methods could substantiate it. 3. Evaluation mismatch: Humans used a 6-way classification task while models op

Reviewer 02Rating 0Confidence 4

Strengths

The method from Section is a good method and an interesting way to extract emotional hierarchies from LLMs.

Weaknesses

The paper presents its results with a lack of clarity and care for the reader. The methodological contribution is never properly evaluated against a ground truth. The text of the paper contradicts the results presented. This paper should and needs to be rewritten for it to be an interesting and meaningful contribution. My suggestion is to focus on the method from section 3 and flesh it out. Furthermore, if the author wants to claim model bias, they should (a) make sure that the stated bias align

Reviewer 03Rating 6Confidence 3

Strengths

- Novelty: The paper creatively uses the emotion wheel—a psychological model—to evaluate hierarchical emotion recognition in LLMs. - Relevance: As LLMs increasingly engage in human-like dialogue, studying their emotional recognition and biases is timely and practically valuable. - Clarity: The presentation is clear and easy to follow. The logical flow helps readers understand both the method and the experiment design. - Comprehensive analysis: The authors test multiple settings, including dif

Weaknesses

- Some related works, such as [1], are neglected. - The analysis of demographic bias (e.g., Figure 7) only covers two emotions (“anger” and “fear”). It would be better to include overall accuracy or more emotion categories to show whether the overall bias is consistent. - The evaluation data are primarily generated by GPT-4, raising questions about whether the performance and conclusions remain valid for real data produced by humans. - The hierarchical emotion relations are derived solely fro

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling