Towards Transferable Speech Emotion Representation: On loss functions   for cross-lingual latent representations

Sneha Das; Nicole Nadine L{\o}nfeldt; Anne Katrine Pagsberg; Line H.; Clemmensen

arXiv:2203.14865·eess.AS·March 29, 2022·1 cites

Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Sneha Das, Nicole Nadine L{\o}nfeldt, Anne Katrine Pagsberg, Line H., Clemmensen

PDF

Open Access

TL;DR

This paper investigates loss functions that improve transferability of speech emotion recognition models across languages, proposing VAE-based methods to achieve consistent latent representations especially for non-tonal languages.

Contribution

It introduces VAE and semi-supervised VAE approaches with KL annealing to enhance cross-lingual transferability in speech emotion recognition.

Findings

01

Semi-supervised VAE achieves comparable accuracy to DAE.

02

VAE methods produce more consistent latent embeddings across datasets.

03

Denoising autoencoder achieves over 52% accuracy in four-class emotion classification.

Abstract

In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques which provide transfer learning possibilities. However, generalizing over languages, corpora and recording conditions is still an open challenge. In this work we address this gap by exploring loss functions that aid in transferability, specifically to non-tonal languages. We propose a variational autoencoder (VAE) with KL annealing and a semi-supervised VAE to obtain more consistent latent embedding distributions across data sets. To ensure transferability, the distribution of the latent embedding should be similar across non-tonal languages (data sets). We start by presenting a low-complexity SER based on a denoising-autoencoder, which achieves an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing