Decoding Emotions: A comprehensive Multilingual Study of Speech Models for Speech Emotion Recognition
Anant Singh, Akshat Gupta

TL;DR
This study evaluates transformer-based speech models for emotion recognition across multiple languages, revealing that specific layers optimize performance and achieving state-of-the-art results in German and Persian.
Contribution
It provides a comprehensive multilingual benchmark for speech emotion recognition and insights into which model layers best capture emotional information.
Findings
Single-layer features reduce error rate by 32% on average.
Achieved state-of-the-art results for German and Persian.
Middle layers of models contain most emotional information.
Abstract
Recent advancements in transformer-based speech representation models have greatly transformed speech processing. However, there has been limited research conducted on evaluating these models for speech emotion recognition (SER) across multiple languages and examining their internal representations. This article addresses these gaps by presenting a comprehensive benchmark for SER with eight speech representation models and six different languages. We conducted probing experiments to gain insights into inner workings of these models for SER. We find that using features from a single optimal layer of a speech model reduces the error rate by 32\% on average across seven datasets when compared to systems where features from all layers of speech models are used. We also achieve state-of-the-art results for German and Persian languages. Our probing results indicate that the middle layers of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Speech and Audio Processing
