Enhancing Speech Emotion Recognition through Segmental Average Pooling   of Self-Supervised Learning Features

Jonghwan Hyeon; Yung-Hwan Oh; Ho-Jin Choi

arXiv:2410.12416·cs.SD·October 17, 2024

Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features

Jonghwan Hyeon, Yung-Hwan Oh, Ho-Jin Choi

PDF

Open Access

TL;DR

This paper introduces Segmental Average Pooling (SAP) to improve Speech Emotion Recognition by focusing on speech segments in self-supervised learning features, outperforming traditional global pooling methods.

Contribution

The paper proposes a novel Segmental Average Pooling method that enhances SSL-based SER by selectively emphasizing speech segments over non-speech segments.

Findings

01

SAP improves SER accuracy on IEMOCAP and KEMDy19 datasets.

02

The combined use of GAP and SAP yields state-of-the-art results.

03

SAP outperforms traditional global average pooling in speech emotion tasks.

Abstract

Speech Emotion Recognition (SER) analyzes human emotions expressed through speech. Self-supervised learning (SSL) offers a promising approach to SER by learning meaningful representations from a large amount of unlabeled audio data. However, existing SSL-based methods rely on Global Average Pooling (GAP) to represent audio signals, treating speech and non-speech segments equally. This can lead to dilution of informative speech features by irrelevant non-speech information. To address this, the paper proposes Segmental Average Pooling (SAP), which selectively focuses on informative speech segments while ignoring non-speech segments. By applying both GAP and SAP to SSL features, our approach utilizes overall speech signal information from GAP and specific information from SAP, leading to improved SER performance. Experiments show state-of-the-art results on the IEMOCAP for English and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition

MethodsAverage Pooling · Global Average Pooling