Speech Emotion Recognition Using Quaternion Convolutional Neural Networks
Aneesh Muppidi, Martin Radfar

TL;DR
This paper introduces a quaternion CNN model for speech emotion recognition that encodes Mel-spectrogram features in quaternion space, outperforming existing methods and achieving state-of-the-art accuracy on multiple datasets.
Contribution
The paper presents a novel quaternion CNN approach for SER that effectively encodes speech features and reduces model size while improving accuracy over real-valued methods.
Findings
Outperforms real-valued methods on RAVDESS dataset
Achieves state-of-the-art accuracy on RAVDESS (77.87%)
Comparable results on IEMOCAP and EMO-DB datasets
Abstract
Although speech recognition has become a widespread technology, inferring emotion from speech signals still remains a challenge. To address this problem, this paper proposes a quaternion convolutional neural network (QCNN) based speech emotion recognition (SER) model in which Mel-spectrogram features of speech signals are encoded in an RGB quaternion domain. We show that our QCNN based SER model outperforms other real-valued methods in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS, 8-classes) dataset, achieving, to the best of our knowledge, state-of-the-art results. The QCNN also achieves comparable results with the state-of-the-art methods in the Interactive Emotional Dyadic Motion Capture (IEMOCAP 4-classes) and Berlin EMO-DB (7-classes) datasets. Specifically, the model achieves an accuracy of 77.87\%, 70.46\%, and 88.78\% for the RAVDESS, IEMOCAP, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
