Non-linear frequency warping using constant-Q transformation for speech emotion recognition
Premjeet Singh, Goutam Saha, Md Sahidullah

TL;DR
This paper investigates the use of constant-Q transform (CQT) for speech emotion recognition, showing that CQT provides better feature representation and generalization than traditional STFT-based features, especially in low-frequency regions.
Contribution
The study introduces CQT-based features for SER and demonstrates their superior performance and generalization over STFT-based features in deep neural network classifiers.
Findings
CQT-based features outperform STFT features in SER accuracy.
CQT features offer better generalization across different datasets.
Lower-frequency resolution in CQT captures more emotion-related information.
Abstract
In this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER). The CQT-based time-frequency analysis provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies. Since lower-frequency regions of speech signal contain more emotion-related information than higher-frequency regions, the increased low-frequency resolution of CQT makes it more promising for SER than standard short-time Fourier transform (STFT). We present a comparative analysis of short-term acoustic features based on STFT and CQT for SER with deep neural network (DNN) as a back-end classifier. We optimize different parameters for both features. The CQT-based features outperform the STFT-based spectral features for SER experiments. Further experiments with cross-corpora evaluation demonstrate that the CQT-based systems provide better generalization with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
