Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Jianyou Wang; Michael Xue; Ryan Culhane; Enmao Diao; Jie Ding; Vahid; Tarokh

arXiv:1910.08874·eess.AS·July 24, 2020

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Jianyou Wang, Michael Xue, Ryan Culhane, Enmao Diao, Jie Ding, Vahid, Tarokh

PDF

TL;DR

This paper introduces a dual-sequence LSTM model for speech emotion recognition that processes MFCC features and mel-spectrograms at different resolutions, achieving significant accuracy improvements over existing unimodal methods.

Contribution

The paper presents a novel dual-sequence LSTM architecture that simultaneously processes multiple audio representations for improved emotion recognition accuracy.

Findings

01

Achieved 72.7% weighted accuracy, 73.3% unweighted accuracy.

02

Outperformed current state-of-the-art unimodal models by 6%.

03

Comparable to multimodal models incorporating textual data.

Abstract

Speech Emotion Recognition (SER) has emerged as a critical component of the next generation human-machine interfacing technologies. In this work, we propose a new dual-level model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%---a 6% improvement over current state-of-the-art unimodal models---and is comparable with multimodal models that leverage textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory