Variational Autoencoders for Learning Latent Representations of Speech Emotion: A Preliminary Study
Siddique Latif, Rajib Rana, Junaid Qadir, Julien Epps

TL;DR
This paper explores using Variational Autoencoders to automatically learn features from speech signals for emotion recognition, achieving state-of-the-art results on the IEMOCAP dataset.
Contribution
It introduces the novel application of VAEs for speech emotion classification, demonstrating their effectiveness over handcrafted features.
Findings
VAE-derived features outperform traditional handcrafted features.
State-of-the-art classification accuracy achieved on IEMOCAP.
First use of VAEs in speech emotion recognition.
Abstract
Learning the latent representation of data in unsupervised fashion is a very interesting process that provides relevant features for enhancing the performance of a classifier. For speech emotion recognition tasks, generating effective features is crucial. Currently, handcrafted features are mostly used for speech emotion recognition, however, features learned automatically using deep learning have shown strong success in many problems, especially in image processing. In particular, deep generative models such as Variational Autoencoders (VAEs) have gained enormous success for generating features for natural images. Inspired by this, we propose VAEs for deriving the latent representation of speech signals and use this representation to classify emotions. To the best of our knowledge, we are the first to propose VAEs for speech emotion classification. Evaluations on the IEMOCAP dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
