# Granularity paradox: how emotion taxonomies shape GPT-5’s affective cognition and human-AI alignment

**Authors:** Fa Zhang, Jian Chen

PMC · DOI: 10.3389/fpsyg.2026.1786724 · Frontiers in Psychology · 2026-03-13

## TL;DR

This study shows that more detailed emotion categories confuse GPT-5, causing it to mislabel emotions and align poorly with human judgments.

## Contribution

The study introduces the 'granularity paradox'—how increasing emotion category detail reduces GPT-5's accuracy and human-AI agreement.

## Key findings

- GPT-5's performance drops sharply with more detailed emotion taxonomies like GoEmotions.
- The model overlabels neutral texts as emotional and misclassifies low-arousal emotions like sadness as high-arousal ones like anger.
- Cultural alignment was not better with the Chinese SevenEmotions taxonomy compared to Western systems.

## Abstract

Large Language Models (LLMs) have demonstrated exceptional capability in textual emotion detection. However, LLM evaluations often treat the emotion taxonomy—the “cognitive ruler” defining the emotional space—as a neutral background variable. The extent to which taxonomic complexity moderates LLM performance remains underexplored.

This study systematically evaluates the impact of emotion taxonomy on GPT-5’s annotation behavior. We constructed a dataset of 2,848 Chinese Weibo posts. Five human annotators and GPT-5 (zero-shot) labeled the data across five distinct taxonomies, each with varying levels of granularity: SemEval (4 classes), Ekman (6 classes), Chinese SevenEmotions (7 classes), Plutchik (8 classes), and GoEmotions (27 classes). A rigorous experimental design, including randomized ordering and washout periods, was implemented to minimize sequence effects. By comparing the results of GPT-5 and manual annotation, the analysis is conducted across three dimensions: performance, consistency, and bias patterns.

Results reveal a significant “granularity paradox”: GPT-5’s performance is strongly negatively correlated with taxonomic complexity, with performance collapsing in fine-grained settings (GoEmotions). Crucially, we identified systematic misalignment mechanisms: (1) Consistency decay: Human-AI agreement significantly deteriorates as semantic boundaries blur in complex taxonomies; (2) Hyper-sensitivity bias: GPT-5 exhibits a tendency to over-interpret neutral texts as emotional, with false-positive rates increasing with taxonomy size; and (3) Arousal shift: The model consistently misclassifies low-arousal negative emotions (e.g., sadness) as high-arousal prototypes (e.g., fear/anger), reflecting a valence-based rather than nuance-based inference logic. Notably, the indigenous SevenEmotions did not yield superior cultural alignment compared to Western taxonomies.

Our findings suggest that emotion taxonomies function as a critical hyperparameter that shapes the cognitive boundaries of GPT. While GPT shows promise, its reliability is compromised by complex taxonomies. Researchers must balance granular detail against model robustness when deploying LLMs for psychological analysis.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13021436/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13021436/full.md

## References

46 references — full list in the complete paper: https://tomesphere.com/paper/PMC13021436/full.md

---
Source: https://tomesphere.com/paper/PMC13021436