Explainable speech emotion recognition through attentive pooling: insights from attention-based temporal localization
Tahitoa Leygue (DIASI (CEA, LIST)), Astrid Sabourin (DIASI (CEA, LIST)), Christian Bolzmacher (DIASI (CEA, LIST)), Sylvain Bouchigny (DIASI (CEA, LIST)), Margarita Anastassova (DIASI (CEA, LIST)), Quoc-Cuong Pham (DIASI (CEA, LIST))

TL;DR
This paper investigates attention-based pooling methods for Speech Emotion Recognition, demonstrating that attentive pooling improves performance, localizes emotion cues effectively, and aligns with human perceptual strategies, with promising results on naturalistic data.
Contribution
It systematically benchmarks pooling strategies, introduces an attentive pooling method that enhances SER performance, and provides insights into emotion cue localization and interpretability.
Findings
Attentive pooling achieves a 3.5% F1 improvement over average pooling.
15% of frames contain 80% of emotion cues.
High-attention frames often include non-linguistic vocalizations and hyperarticulated phonemes.
Abstract
State-of-the-art transformer models for Speech Emotion Recognition (SER) rely on temporal feature aggregation, yet advanced pooling methods remain underexplored. We systematically benchmark pooling strategies, including Multi-Query Multi-Head Attentive Statistics Pooling, which achieves a 3.5 percentage point macro F1 gain over average pooling. Attention analysis shows 15 percent of frames capture 80 percent of emotion cues, revealing a localized pattern of emotional information. Analysis of high-attention frames reveals that non-linguistic vocalizations and hyperarticulated phonemes are disproportionately prioritized during pooling, mirroring human perceptual strategies. Our findings position attentive pooling as both a performant SER mechanism and a biologically plausible tool for explainable emotion localization. On Interspeech 2025 Speech Emotion Recognition in Naturalistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining
