ASR-based Features for Emotion Recognition: A Transfer Learning Approach

No\'e Tits; Kevin El Haddad; Thierry Dutoit

arXiv:1805.09197·eess.AS·June 4, 2018

ASR-based Features for Emotion Recognition: A Transfer Learning Approach

No\'e Tits, Kevin El Haddad, Thierry Dutoit

PDF

TL;DR

This paper explores using neural ASR features as a transfer learning method for emotion recognition, demonstrating that ASR-derived features outperform traditional features in predicting emotional dimensions from speech.

Contribution

It introduces a novel approach of leveraging neural ASR models as feature extractors for emotion recognition, highlighting the importance of different layer representations.

Findings

01

ASR features outperform eGeMAPS in valence and arousal prediction

02

First and last layers of ASR have different relevance to emotional dimensions

03

ASR-based features encode emotional information in spontaneous speech

Abstract

During the last decade, the applications of signal processing have drastically improved with deep learning. However areas of affecting computing such as emotional speech synthesis or emotion recognition from spoken language remains challenging. In this paper, we investigate the use of a neural Automatic Speech Recognition (ASR) as a feature extractor for emotion recognition. We show that these features outperform the eGeMAPS feature set to predict the valence and arousal emotional dimensions, which means that the audio-to-text mapping learning by the ASR system contain information related to the emotional dimensions in spontaneous speech. We also examine the relationship between first layers (closer to speech) and last layers (closer to text) of the ASR and valence/arousal.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.