Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction
Abdul Rehman, Zhen-Tao Liu, Min Wu, Wei-Hua Cao, and Cheng-Shan Jiang

TL;DR
This paper introduces a real-time speech emotion recognition system that decomposes audio into syllable-level features, enabling faster processing and improved cross-corpus accuracy compared to traditional deep learning methods.
Contribution
The paper proposes a novel syllable-level feature extraction approach combined with simple neural networks for real-time emotion recognition, enhancing speed and cross-corpus reliability.
Findings
Achieves real-time latency in emotion prediction.
Attains state-of-the-art cross-corpus accuracy of 47.6% and 56.2%.
Demonstrates robustness across multiple speech datasets.
Abstract
Speech emotion recognition systems have high prediction latency because of the high computational requirements for deep learning models and low generalizability mainly because of the poor reliability of emotional measurements across multiple corpora. To solve these problems, we present a speech emotion recognition system based on a reductionist approach of decomposing and analyzing syllable-level features. Mel-spectrogram of an audio stream is decomposed into syllable-level components, which are then analyzed to extract statistical features. The proposed method uses formant attention, noise-gate filtering, and rolling normalization contexts to increase feature processing speed and tolerance to adversity. A set of syllable-level formant features is extracted and fed into a single hidden layer neural network that makes predictions for each syllable as opposed to the conventional approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
