Multi-Channel Auto-Encoder for Speech Emotion Recognition

Zefang Zong; Hao Li; Qi Wang

arXiv:1810.10662·cs.SD·October 26, 2018·5 cites

Multi-Channel Auto-Encoder for Speech Emotion Recognition

Zefang Zong, Hao Li, Qi Wang

PDF

Open Access

TL;DR

This paper introduces a multi-channel auto-encoder framework that leverages multiple local neural networks to improve speech emotion recognition accuracy from acoustic features, outperforming existing methods on a benchmark dataset.

Contribution

The paper proposes a novel multi-channel auto-encoder architecture that combines local DNNs with different descriptors and statistics to capture both local and global features for emotion recognition.

Findings

01

Achieved 64.8% unweighted accuracy on IEMOCAP, surpassing previous state-of-the-art by 2.4%.

02

Demonstrated the effectiveness of combining multiple descriptors and statistics in emotion recognition.

03

Outperformed existing methods significantly on a benchmark speech emotion dataset.

Abstract

Inferring emotion status from users' queries plays an important role to enhance the capacity in voice dialogues applications. Even though several related works obtained satisfactory results, the performance can still be further improved. In this paper, we proposed a novel framework named multi-channel auto-encoder (MTC-AE) on emotion recognition from acoustic information. MTC-AE contains multiple local DNNs based on different low-level descriptors with different statistics functions that are partly concatenated together, by which the structure is enabled to consider both local and global features simultaneously. Experiment based on a benchmark dataset IEMOCAP shows that our method significantly outperforms the existing state-of-the-art results, achieving $64.8%$ leave-one-speaker-out unweighted accuracy, which is $2.4%$ higher than the best result on this dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Speech Recognition and Synthesis