Analysis of constant-Q filterbank based representations for speech emotion recognition
Premjeet Singh, Shefali Waldekar, Md Sahidullah, Goutam Saha

TL;DR
This paper investigates constant-Q filterbank-based time-frequency representations for speech emotion recognition, demonstrating their advantages in frequency resolution and robustness against pitch variations, leading to improved emotion classification performance.
Contribution
It provides a comprehensive analysis of constant-Q representations for SER, highlighting their benefits over traditional features and validating their effectiveness with deep neural networks.
Findings
Constant-Q features offer higher low-frequency resolution.
They provide increased robustness against pitch variations.
SER performance improves with constant-Q representations.
Abstract
This work analyzes the constant-Q filterbank-based time-frequency representations for speech emotion recognition (SER). Constant-Q filterbank provides non-linear spectro-temporal representation with higher frequency resolution at low frequencies. Our investigation reveals how the increased low-frequency resolution benefits SER. The time-domain comparative analysis between short-term mel-frequency spectral coefficients (MFSCs) and constant-Q filterbank-based features, namely constant-Q transform (CQT) and continuous wavelet transform (CWT), reveals that constant-Q representations provide higher time-invariance at low-frequencies. This provides increased robustness against emotion irrelevant temporal variations in pitch, especially for low-arousal emotions. The corresponding frequency-domain analysis over different emotion classes shows better resolution of pitch harmonics in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
