Explainable speech emotion recognition through attentive pooling: insights from attention-based temporal localization

Tahitoa Leygue (DIASI (CEA; LIST)); Astrid Sabourin (DIASI (CEA; LIST)); Christian Bolzmacher (DIASI (CEA; LIST)); Sylvain Bouchigny (DIASI (CEA; LIST)); Margarita Anastassova (DIASI (CEA; LIST)); Quoc-Cuong Pham (DIASI (CEA; LIST))

arXiv:2506.15754·cs.SD·June 23, 2025

Explainable speech emotion recognition through attentive pooling: insights from attention-based temporal localization

Tahitoa Leygue (DIASI (CEA, LIST)), Astrid Sabourin (DIASI (CEA, LIST)), Christian Bolzmacher (DIASI (CEA, LIST)), Sylvain Bouchigny (DIASI (CEA, LIST)), Margarita Anastassova (DIASI (CEA, LIST)), Quoc-Cuong Pham (DIASI (CEA, LIST))

PDF

Open Access

TL;DR

This paper investigates attention-based pooling methods for Speech Emotion Recognition, demonstrating that attentive pooling improves performance, localizes emotion cues effectively, and aligns with human perceptual strategies, with promising results on naturalistic data.

Contribution

It systematically benchmarks pooling strategies, introduces an attentive pooling method that enhances SER performance, and provides insights into emotion cue localization and interpretability.

Findings

01

Attentive pooling achieves a 3.5% F1 improvement over average pooling.

02

15% of frames contain 80% of emotion cues.

03

High-attention frames often include non-linguistic vocalizations and hyperarticulated phonemes.

Abstract

State-of-the-art transformer models for Speech Emotion Recognition (SER) rely on temporal feature aggregation, yet advanced pooling methods remain underexplored. We systematically benchmark pooling strategies, including Multi-Query Multi-Head Attentive Statistics Pooling, which achieves a 3.5 percentage point macro F1 gain over average pooling. Attention analysis shows 15 percent of frames capture 80 percent of emotion cues, revealing a localized pattern of emotional information. Analysis of high-attention frames reveals that non-linguistic vocalizations and hyperarticulated phonemes are disproportionately prioritized during pooling, mirroring human perceptual strategies. Our findings position attentive pooling as both a performant SER mechanism and a biologically plausible tool for explainable emotion localization. On Interspeech 2025 Speech Emotion Recognition in Naturalistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining