Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs

Changhao Song; Yazhou Zhang; Hui Gao; Kaiyun Huang; Peng Zhang

arXiv:2505.22548·cs.CL·August 7, 2025

Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs

Changhao Song, Yazhou Zhang, Hui Gao, Kaiyun Huang, Peng Zhang

PDF

Open Access 4 Reviews

TL;DR

Emotion-o1 introduces an adaptive reasoning framework for LLMs that dynamically balances reasoning depth and efficiency, significantly improving emotion understanding across various tasks with reduced reasoning length.

Contribution

The paper presents Emotion-o1, a novel adaptive CoT framework that adjusts reasoning length based on task complexity, enhancing emotion understanding in LLMs.

Findings

01

Significant F1 score improvements across emotion tasks.

02

Outperforms advanced LLMs like Grok-3 and Claude-3.

03

Reduces reasoning length by 83% while maintaining accuracy.

Abstract

Long chain-of-thought (CoT) reasoning has shown great promise in enhancing the emotion understanding performance of large language models (LLMs). However, current fixed-length CoT methods struggle to balance reasoning depth and efficiency. Simple tasks (e.g., sentiment classification) are over-reasoned, while complex tasks (e.g., sarcasm understanding) lack depth. To fill this gap, we present Emotion-o1, an adaptive CoT framework that dynamically adjusts reasoning length based on emotion-task complexity. Emotion-o1 is trained by distilling adaptive CoT patterns from a reasoning-oriented LLM, followed by supervised fine-tuning and reinforcement learning with a four-part reward targeting accuracy, brevity, structure, and redundancy. Experimental results on four emotion tasks highlight: (1) Emotion-o1 demonstrates significant improvements over its backbone, with F1 score increases of…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

- The paper is motivated by the fact that emotion understanding tasks have various difficulties and different reasoning depth needs, which is a fair argument. - Emotion-o1 demonstrates strong results on the target tasks and effectiveness of the training recipe on a small model (8B) - Emotion-o1 has significantly higher token efficiency across the four tasks while maintaining competitive performances compared to SoTA reasoning models like DeepSeek-R1 and OpenAI-o1.

Weaknesses

- Emotion-o1 undergoes task-specific training, whereas the baselines appear to be evaluated zero-shot, which favors the proposed model. For a fair comparison, the authors should evaluate the additional finetuned (SFT and RL) baselines to isolate adaptive reasoning from task-specific training. - Generalization of Emotion-o1 is unexplored. Improvements with in-domain training are expected and add little new insight for the research community. The authors should evaluate the model on unseen dataset

Reviewer 02Rating 2Confidence 4

Strengths

1. The framework can dynamically adjust reasoning depth according to task complexity in emotion understanding, moving beyond the fixed-length CoT paradigm. 2. A carefully constructed reward function jointly optimizes accuracy, brevity, structural coherence, and redundancy control.

Weaknesses

1. The method retains only label-consistent CoTs from the teacher model, effectively conditioning training on the gold label. This selection can bias the student toward reproducing “correct answer patterns” rather than learning generalizable reasoning behavior. The paper does not analyze how much of the downstream improvement depends on this filtering or whether it leads to shortcut learning. 2. It remains unclear why emotion tasks require explicit CoT reasoning. The paper does not convincingly

Reviewer 03Rating 4Confidence 4

Strengths

1. Clear problem framing & method: Adaptive depth for simple vs. complex emotion tasks; the training pipeline and reward components are specified with equations and hyperparameters/intuition. 2. Empirical signal: On four benchmarks, Emotion-o1 improves over its backbone (up to +27% Weighted-F1 on sarcasm), and short/long ablations align with task complexity (short helps sentiment; long helps sarcasm/humor). 3. Data-driven length control. The RL length reward sets task-specific length from quanti

Weaknesses

1. Missing reward ablations. The paper doesn’t show the effect of removing or reweighting each reward term. 2. Missing standard baselines. No head-to-head comparisons with plain SFT or vanilla RL (e.g., PPO) under the same data/compute. 3. No OOD evaluation. Results are only in-domain; there’s no test on out-of-distribution questions.

Reviewer 04Rating 2Confidence 5

Strengths

- The description is very thorough, and the approach very interesting. Using RL for subjective constructs is still an open problem. - Results presented across 4 tasks - Many baselines models were evaluated.

Weaknesses

In constructing their framework and using their datasets, the authors make several assumptions, some of which I think could be lead the affective community down a wrong path: - First of all, the authors declare emotions to be a "simple" task, which I find quite inappropriate. Even in their results, emotions seem to be on par with sarcasm and humor. The only justification I see for this claim is that the benchmark they use is simplistic (only one emotion per utterance, only 7 basic emotions), whi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Multimodal Machine Learning Applications