Advancing Multiple Instance Learning with Attention Modeling for   Categorical Speech Emotion Recognition

Shuiyang Mao; P. C. Ching; C.-C. Jay Kuo; Tan Lee

arXiv:2008.06667·eess.AS·August 18, 2020·1 cites

Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition

Shuiyang Mao, P. C. Ching, C.-C. Jay Kuo, Tan Lee

PDF

Open Access

TL;DR

This paper introduces an attention-based multiple instance learning framework for speech emotion recognition, effectively extracting segment-level emotional cues from weakly labeled utterances, achieving state-of-the-art results.

Contribution

It proposes a novel combination of MIL and attention mechanisms to improve categorical speech emotion recognition from weakly labeled data.

Findings

01

Outperforms existing methods on CASIA and IEMOCAP datasets.

02

Effectively identifies salient emotional segments within utterances.

03

Achieves competitive or superior accuracy compared to state-of-the-art approaches.

Abstract

Categorical speech emotion recognition is typically performed as a sequence-to-label problem, i.e., to determine the discrete emotion label of the input utterance as a whole. One of the main challenges in practice is that most of the existing emotion corpora do not give ground truth labels for each segment; instead, we only have labels for whole utterances. To extract segment-level emotional information from such weakly labeled emotion corpora, we propose using multiple instance learning (MIL) to learn segment embeddings in a weakly supervised manner. Also, for a sufficiently long utterance, not all of the segments contain relevant emotional information. In this regard, three attention-based neural network models are then applied to the learned segment embeddings to attend the most salient part of a speech utterance. Experiments on the CASIA corpus and the IEMOCAP database show better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Emotion and Mood Recognition · Speech Recognition and Synthesis