Temporal-Frequency State Space Duality: An Efficient Paradigm for Speech   Emotion Recognition

Jiaqi Zhao; Fei Wang; Kun Li; Yanyan Wei; Shengeng Tang; Shu Zhao,; Xiao Sun

arXiv:2412.16904·cs.SD·December 24, 2024

Temporal-Frequency State Space Duality: An Efficient Paradigm for Speech Emotion Recognition

Jiaqi Zhao, Fei Wang, Kun Li, Yanyan Wei, Shengeng Tang, Shu Zhao,, Xiao Sun

PDF

Open Access

TL;DR

This paper introduces TF-Mamba, a multi-domain framework that effectively captures emotional cues in both temporal and frequency domains for speech emotion recognition, improving efficiency and robustness.

Contribution

The paper presents a novel temporal-frequency mamba block and a Complex Metric-Distance Triplet loss, enhancing emotion recognition by leveraging dual-domain features with improved efficiency.

Findings

01

Outperforms existing methods on IEMOCAP and MELD datasets.

02

Reduces model size and latency compared to prior approaches.

03

Achieves a better balance between computational efficiency and model expressiveness.

Abstract

Speech Emotion Recognition (SER) plays a critical role in enhancing user experience within human-computer interaction. However, existing methods are overwhelmed by temporal domain analysis, overlooking the valuable envelope structures of the frequency domain that are equally important for robust emotion recognition. To overcome this limitation, we propose TF-Mamba, a novel multi-domain framework that captures emotional expressions in both temporal and frequency dimensions.Concretely, we propose a temporal-frequency mamba block to extract temporal- and frequency-aware emotional features, achieving an optimal balance between computational efficiency and model expressiveness. Besides, we design a Complex Metric-Distance Triplet (CMDT) loss to enable the model to capture representative emotional clues for SER. Extensive experiments on the IEMOCAP and MELD datasets show that TF-Mamba…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis