When Tone and Words Disagree: Towards Robust Speech Emotion Recognition under Acoustic-Semantic Conflict
Dawei Huang, Yongjie Lv, Ruijie Xiong, Chunxiang Jin, Xiaojiang Peng

TL;DR
This paper addresses the challenge of acoustic-semantic conflicts in speech emotion recognition by proposing a novel framework that disentangles acoustic and semantic information, and introduces a new dataset for evaluation.
Contribution
The paper introduces the Fusion Acoustic-Semantic (FAS) framework and the CASE dataset, improving SER robustness under conflicting acoustic and semantic cues.
Findings
FAS outperforms existing methods in various settings.
Conventional SER models fail under acoustic-semantic conflicts.
FAS achieves 59.38% accuracy on the CASE benchmark.
Abstract
Speech Emotion Recognition (SER) systems often assume congruence between vocal emotion and lexical semantics. However, in real-world interactions, acoustic-semantic conflict is common yet overlooked, where the emotion conveyed by tone contradicts the literal meaning of spoken words. We show that state-of-the-art SER models, including ASR-based, self-supervised learning (SSL) approaches and Audio Language Models (ALMs), suffer performance degradation under such conflicts due to semantic bias or entangled acoustic-semantic representations. To address this, we propose the Fusion Acoustic-Semantic (FAS) framework, which explicitly disentangles acoustic and semantic pathways and bridges them through a lightweight, query-based attention module. To enable systematic evaluation, we introduce the Conflict in Acoustic-Semantic Emotion (CASE), the first dataset dominated by clear and interpretable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Music and Audio Processing · Sentiment Analysis and Opinion Mining
