Emotion-o1: Adaptive Long Reasoning for Emotion Understanding in LLMs
Changhao Song, Yazhou Zhang, Hui Gao, Kaiyun Huang, Peng Zhang

TL;DR
Emotion-o1 introduces an adaptive reasoning framework for LLMs that dynamically balances reasoning depth and efficiency, significantly improving emotion understanding across various tasks with reduced reasoning length.
Contribution
The paper presents Emotion-o1, a novel adaptive CoT framework that adjusts reasoning length based on task complexity, enhancing emotion understanding in LLMs.
Findings
Significant F1 score improvements across emotion tasks.
Outperforms advanced LLMs like Grok-3 and Claude-3.
Reduces reasoning length by 83% while maintaining accuracy.
Abstract
Long chain-of-thought (CoT) reasoning has shown great promise in enhancing the emotion understanding performance of large language models (LLMs). However, current fixed-length CoT methods struggle to balance reasoning depth and efficiency. Simple tasks (e.g., sentiment classification) are over-reasoned, while complex tasks (e.g., sarcasm understanding) lack depth. To fill this gap, we present Emotion-o1, an adaptive CoT framework that dynamically adjusts reasoning length based on emotion-task complexity. Emotion-o1 is trained by distilling adaptive CoT patterns from a reasoning-oriented LLM, followed by supervised fine-tuning and reinforcement learning with a four-part reward targeting accuracy, brevity, structure, and redundancy. Experimental results on four emotion tasks highlight: (1) Emotion-o1 demonstrates significant improvements over its backbone, with F1 score increases of…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is motivated by the fact that emotion understanding tasks have various difficulties and different reasoning depth needs, which is a fair argument. - Emotion-o1 demonstrates strong results on the target tasks and effectiveness of the training recipe on a small model (8B) - Emotion-o1 has significantly higher token efficiency across the four tasks while maintaining competitive performances compared to SoTA reasoning models like DeepSeek-R1 and OpenAI-o1.
- Emotion-o1 undergoes task-specific training, whereas the baselines appear to be evaluated zero-shot, which favors the proposed model. For a fair comparison, the authors should evaluate the additional finetuned (SFT and RL) baselines to isolate adaptive reasoning from task-specific training. - Generalization of Emotion-o1 is unexplored. Improvements with in-domain training are expected and add little new insight for the research community. The authors should evaluate the model on unseen dataset
1. The framework can dynamically adjust reasoning depth according to task complexity in emotion understanding, moving beyond the fixed-length CoT paradigm. 2. A carefully constructed reward function jointly optimizes accuracy, brevity, structural coherence, and redundancy control.
1. The method retains only label-consistent CoTs from the teacher model, effectively conditioning training on the gold label. This selection can bias the student toward reproducing “correct answer patterns” rather than learning generalizable reasoning behavior. The paper does not analyze how much of the downstream improvement depends on this filtering or whether it leads to shortcut learning. 2. It remains unclear why emotion tasks require explicit CoT reasoning. The paper does not convincingly
1. Clear problem framing & method: Adaptive depth for simple vs. complex emotion tasks; the training pipeline and reward components are specified with equations and hyperparameters/intuition. 2. Empirical signal: On four benchmarks, Emotion-o1 improves over its backbone (up to +27% Weighted-F1 on sarcasm), and short/long ablations align with task complexity (short helps sentiment; long helps sarcasm/humor). 3. Data-driven length control. The RL length reward sets task-specific length from quanti
1. Missing reward ablations. The paper doesn’t show the effect of removing or reweighting each reward term. 2. Missing standard baselines. No head-to-head comparisons with plain SFT or vanilla RL (e.g., PPO) under the same data/compute. 3. No OOD evaluation. Results are only in-domain; there’s no test on out-of-distribution questions.
- The description is very thorough, and the approach very interesting. Using RL for subjective constructs is still an open problem. - Results presented across 4 tasks - Many baselines models were evaluated.
In constructing their framework and using their datasets, the authors make several assumptions, some of which I think could be lead the affective community down a wrong path: - First of all, the authors declare emotions to be a "simple" task, which I find quite inappropriate. Even in their results, emotions seem to be on par with sarcasm and humor. The only justification I see for this claim is that the benchmark they use is simplistic (only one emotion per utterance, only 7 basic emotions), whi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Multimodal Machine Learning Applications
