EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning
Dingdong Wang, Shujie Liu, Tianhua Zhang, Youjun Chen, Jinyu Li, Helen Meng

TL;DR
EmotionThinker reformulates speech emotion recognition as a deep reasoning task using reinforcement learning, leveraging prosody cues for interpretable predictions and outperforming existing models in accuracy and explanation quality.
Contribution
It introduces a novel reasoning-based framework for speech emotion recognition, incorporating prosody enhancement and a new RL algorithm with trust-aware reasoning rewards.
Findings
EmotionThinker achieves higher emotion recognition accuracy.
It provides more interpretable explanations grounded in acoustic cues.
Prosody enhancement improves emotion understanding.
Abstract
Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs' expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody…
Peer Reviews
Decision·ICLR 2026 Oral
The motivation is clear, and the research problem is interesting, as it extends beyond improving emotion classification toward developing deeper reasoning capabilities. The proposed model demonstrates strong performance in both emotion recognition and emotion reasoning, providing valuable insights for advancing SpeechLLMs toward more effective emotion reasoning capabilities.
It is unclear how your model is trained and how it builds upon Qwen2.5-Omni-3B. Please clarify the training process and provide clear explanations for all symbols and notations in your equations, as they are currently difficult to interpret. The methodology section, particularly Section 3.3, lacks clarity. Please provide a clear description of the overall training pipeline and explain the motivation behind each step. The writing in Section 3.3.1 should be further improved for better structure a
**1.** The motivation of the work is clearly stated and explained. **2.** A first RL-based emotion recognition that has the ability not only for accurate classification, but detailed reasoning rationales and informative captions for the audio. **3.** Each stage of the proposed framework is clearly defined. **4.** The evaluation and abolition are comprehensive.
**1.** For the accuracy of emotion recognition, I would also like to know the performance on each individual discrete emotion. That way, we can have a more concrete and detailed understanding of the framework's capabilities and limitations. **2.** To construct the reasoning responses, is there a specific reason that only GPT 4.0 is used?
1. The reformulation of SER as a deep reasoning task—rather than mere label prediction—is timely and promising for advancing interpretability in multimodal LLMs. 2. The proposed dataset, EmotionCoT-35K, fills a significant gap with CoT-style, prosody-aware emotion reasoning data, with a scalable, largely automated annotation pipeline. This may have value for the broader community. 3. The proposed reinforcement learning scheme employs progressive reward scheduling and a trustworthiness weight to
1. The data construction pipeline heavily relies on LLMs, and the reasoning trace data is constructed with GPT4o without the actual speech input. This may lead to unexpected failure and bias in the dataset. It would also be beneficial to input the speech and conduct a human review of the data quality. 2. The proposed reward model plays a critical role in the RL process. However, there is little discussion or quantitative validation of its calibration. The distributions of GPT-annotated versus hu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Explainable Artificial Intelligence (XAI) · Sentiment Analysis and Opinion Mining
