Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhixian Zhao; Xinfa Zhu; Xinsheng Wang; Shuiyuan Wang; Xuelong Geng; Wenjie Tian; Lei Xie

arXiv:2502.18186·cs.SD·December 30, 2025

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhixian Zhao, Xinfa Zhu, Xinsheng Wang, Shuiyuan Wang, Xuelong Geng, Wenjie Tian, Lei Xie

PDF

1 Repo 2 Models 1 Datasets

TL;DR

This paper introduces C$^2$SER, a novel large-scale audio language model that improves speech emotion recognition stability and accuracy by integrating perceptual modules, chain of thought reasoning, and self-distillation techniques.

Contribution

C$^2$SER is the first to combine contextual perception, chain of thought, and self-distillation for stable and accurate speech emotion recognition in large-scale audio language models.

Findings

01

C$^2$SER outperforms existing models like Qwen2-Audio and SECap.

02

It achieves higher stability and precision in emotion recognition.

03

Extensive experiments validate its effectiveness.

Abstract

Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C $^{2}$ SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C $^{2}$ SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C $^{2}$ SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zxzhao0/c2ser
pytorchOfficial

Models

Datasets

ASLP-lab/Emo-Emilia
dataset· 532 dl
532 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.