Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods
Ali Shendabadi, Parnia Izadirad, Mostafa Salehi, Mahmoud Bijankhan

TL;DR
This paper investigates using OpenAI's Whisper representations combined with novel attention-based pooling methods for speech emotion recognition, achieving state-of-the-art results on Persian datasets and demonstrating Whisper's effectiveness across languages.
Contribution
It introduces two attention-based pooling techniques for Whisper features and demonstrates their effectiveness in improving speech emotion recognition performance.
Findings
State-of-the-art accuracy on ShEMO dataset with QKV pooling.
Intermediate Whisper layers often outperform final layers for SER.
Efficient alternative to larger models like HuBERT X-Large.
Abstract
Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech Recognition and Synthesis
