Modulation spectral features for speech emotion recognition using deep neural networks
Premjeet Singh, Md Sahidullah, Goutam Saha

TL;DR
This paper introduces constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition, demonstrating improved performance over traditional features by capturing emotion-specific information through a biologically inspired spectral-temporal representation.
Contribution
The work proposes a novel CQT-MSF feature extraction method combined with deep learning, outperforming existing spectrogram and scattering transform features in speech emotion recognition.
Findings
CQT-MSF outperforms mel-scale spectrogram features on SER databases.
CQT-MSF surpasses scattering transform coefficients in emotion recognition.
Grad-CAM analysis confirms the importance of CQT-MSF features in SER.
Abstract
This work explores the use of constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition (SER). The human perception and analysis of sound comprise of two important cognitive parts: early auditory analysis and cortex-based processing. The early auditory analysis considers spectrogram-based representation whereas cortex-based analysis includes extraction of temporal modulations from the spectrogram. This temporal modulation representation of spectrogram is called modulation spectral feature (MSF). As the constant-Q transform (CQT) provides higher resolution at emotion salient low-frequency regions of speech, we find that CQT-based spectrogram, together with its temporal modulations, provides a representation enriched with emotion-specific information. We argue that CQT-MSF when used with a 2-dimensional convolutional network can provide a time-shift…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
