Deep scattering network for speech emotion recognition
Premjeet Singh, Goutam Saha, Md Sahidullah

TL;DR
This paper proposes using scattering transform features for speech emotion recognition, demonstrating improved robustness and accuracy over traditional features like MFCCs across multiple datasets.
Contribution
It introduces scattering transform for SER, showing its advantages in capturing emotion cues while being invariant to irrelevant variations, and analyzes layer-wise coefficients for better understanding.
Findings
Frequency scattering outperforms time-domain scattering and MFCCs.
Layer-wise scattering coefficients perform better than MFCCs.
Scattering features are robust against speaker, language, and gender variations.
Abstract
This paper introduces scattering transform for speech emotion recognition (SER). Scattering transform generates feature representations which remain stable to deformations and shifting in time and frequency without much loss of information. In speech, the emotion cues are spread across time and localised in frequency. The time and frequency invariance characteristic of scattering coefficients provides a representation robust against emotion irrelevant variations e.g., different speakers, language, gender etc. while preserving the variations caused by emotion cues. Hence, such a representation captures the emotion information more efficiently from speech. We perform experiments to compare scattering coefficients with standard mel-frequency cepstral coefficients (MFCCs) over different databases. It is observed that frequency scattering performs better than time-domain scattering and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
