TL;DR
SigWavNet introduces an end-to-end deep learning framework that combines wavelet transforms with neural networks to improve speech emotion recognition by capturing multi-resolution features directly from raw speech signals.
Contribution
The paper presents a novel wavelet-based deep learning model that learns wavelet bases and denoising jointly, enhancing feature extraction for speech emotion recognition without pre-processing.
Findings
Outperforms state-of-the-art on IEMOCAP and EMO-DB datasets.
Effectively handles variable-length speech signals.
Eliminates need for pre or post-processing steps.
Abstract
In the field of human-computer interaction and psychological assessment, speech emotion recognition (SER) plays an important role in deciphering emotional states from speech signals. Despite advancements, challenges persist due to system complexity, feature distinctiveness issues, and noise interference. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER, addressing these limitations by extracting meaningful representations directly from raw waveform speech signals. By leveraging the properties of the fast discrete wavelet transform (FDWT), including the cascade algorithm, conjugate quadrature filter, and coefficient denoising, our approach introduces a learnable model for both wavelet bases and denoising through deep learning techniques. The framework incorporates an activation function for learnable asymmetric hard thresholding of wavelet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
