How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Marc Casals-Salvador; Federico Costa; Rodolfo Zevallos; Javier Hernando

arXiv:2603.15120·eess.AS·March 17, 2026

How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Marc Casals-Salvador, Federico Costa, Rodolfo Zevallos, Javier Hernando

PDF

Open Access

TL;DR

This paper benchmarks various optimized attention mechanisms for Speech Emotion Recognition, highlighting the trade-off between accuracy and efficiency, and providing insights for scalable SER system design.

Contribution

It systematically compares multiple efficient attention variants for SER, demonstrating their scalability benefits over standard self-attention.

Findings

01

Standard self-attention achieves highest accuracy.

02

Efficient attention variants significantly reduce latency and memory usage.

03

Trade-offs exist between recognition performance and computational efficiency.

Abstract

Speech Emotion Recognition (SER) plays a key role in advancing human-computer interaction. Attention mechanisms have become the dominant approach for modeling emotional speech due to their ability to capture long-range dependencies and emphasize salient information. However, standard self-attention suffers from quadratic computational and memory complexity, limiting its scalability. In this work, we present a systematic benchmark of optimized attention mechanisms for SER, including RetNet, LightNet, GSA, FoX, and KDA. Experiments on both MSP-Podcast benchmark versions show that while standard self-attention achieves the strongest recognition performance across test sets, efficient attention variants dramatically improve scalability, reducing inference latency and memory usage by up to an order of magnitude. These results highlight a critical trade-off between accuracy and efficiency,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Music and Audio Processing