Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Zhixian Zhao, Xinfa Zhu, Xinsheng Wang, Shuiyuan Wang, Xuelong Geng, Wenjie Tian, Lei Xie

TL;DR
This paper introduces C$^2$SER, a novel large-scale audio language model that improves speech emotion recognition stability and accuracy by integrating perceptual modules, chain of thought reasoning, and self-distillation techniques.
Contribution
C$^2$SER is the first to combine contextual perception, chain of thought, and self-distillation for stable and accurate speech emotion recognition in large-scale audio language models.
Findings
C$^2$SER outperforms existing models like Qwen2-Audio and SECap.
It achieves higher stability and precision in emotion recognition.
Extensive experiments validate its effectiveness.
Abstract
Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose CSER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). CSER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, CSER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
