Charting 15 years of progress in deep learning for speech emotion recognition: A replication study

Andreas Triantafyllopoulos; Anton Batliner; Bj\"orn W. Schuller

arXiv:2508.02448·cs.SD·August 5, 2025

Charting 15 years of progress in deep learning for speech emotion recognition: A replication study

Andreas Triantafyllopoulos, Anton Batliner, Bj\"orn W. Schuller

PDF

Open Access

TL;DR

This study reviews 15 years of deep learning advancements in speech emotion recognition, revealing diminishing returns with newer models and emphasizing the importance of model selection in perceived progress.

Contribution

It provides a comprehensive quantification of progress in SER over 15 years and analyzes the impact of different model architectures, highlighting the plateau after transformer models.

Findings

01

Diminishing returns observed with recent deep learning models

02

Plateau in progress after transformer architectures

03

Perceptions of progress depend on model comparison choices

Abstract

Speech emotion recognition (SER) has long benefited from the adoption of deep learning methodologies. Deeper models -- with more layers and more trainable parameters -- are generally perceived as being `better' by the SER community. This raises the question -- \emph{how much better} are modern-era deep neural networks compared to their earlier iterations? Beyond that, the more important question of how to move forward remains as poignant as ever. SER is far from a solved problem; therefore, identifying the most prominent avenues of future research is of paramount importance. In the present contribution, we attempt a quantification of progress in the 15 years of research beginning with the introduction of the landmark 2009 INTERSPEECH Emotion Challenge. We conduct a large scale investigation of model architectures, spanning both audio-based models that rely on speech inputs and text-baed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining