ASR-based Features for Emotion Recognition: A Transfer Learning Approach
No\'e Tits, Kevin El Haddad, Thierry Dutoit

TL;DR
This paper explores using neural ASR features as a transfer learning method for emotion recognition, demonstrating that ASR-derived features outperform traditional features in predicting emotional dimensions from speech.
Contribution
It introduces a novel approach of leveraging neural ASR models as feature extractors for emotion recognition, highlighting the importance of different layer representations.
Findings
ASR features outperform eGeMAPS in valence and arousal prediction
First and last layers of ASR have different relevance to emotional dimensions
ASR-based features encode emotional information in spontaneous speech
Abstract
During the last decade, the applications of signal processing have drastically improved with deep learning. However areas of affecting computing such as emotional speech synthesis or emotion recognition from spoken language remains challenging. In this paper, we investigate the use of a neural Automatic Speech Recognition (ASR) as a feature extractor for emotion recognition. We show that these features outperform the eGeMAPS feature set to predict the valence and arousal emotional dimensions, which means that the audio-to-text mapping learning by the ASR system contain information related to the emotional dimensions in spontaneous speech. We also examine the relationship between first layers (closer to speech) and last layers (closer to text) of the ASR and valence/arousal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
