Advancing Multiple Instance Learning with Attention Modeling for Categorical Speech Emotion Recognition
Shuiyang Mao, P. C. Ching, C.-C. Jay Kuo, Tan Lee

TL;DR
This paper introduces an attention-based multiple instance learning framework for speech emotion recognition, effectively extracting segment-level emotional cues from weakly labeled utterances, achieving state-of-the-art results.
Contribution
It proposes a novel combination of MIL and attention mechanisms to improve categorical speech emotion recognition from weakly labeled data.
Findings
Outperforms existing methods on CASIA and IEMOCAP datasets.
Effectively identifies salient emotional segments within utterances.
Achieves competitive or superior accuracy compared to state-of-the-art approaches.
Abstract
Categorical speech emotion recognition is typically performed as a sequence-to-label problem, i.e., to determine the discrete emotion label of the input utterance as a whole. One of the main challenges in practice is that most of the existing emotion corpora do not give ground truth labels for each segment; instead, we only have labels for whole utterances. To extract segment-level emotional information from such weakly labeled emotion corpora, we propose using multiple instance learning (MIL) to learn segment embeddings in a weakly supervised manner. Also, for a sufficiently long utterance, not all of the segments contain relevant emotional information. In this regard, three attention-based neural network models are then applied to the learned segment embeddings to attend the most salient part of a speech utterance. Experiments on the CASIA corpus and the IEMOCAP database show better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Emotion and Mood Recognition · Speech Recognition and Synthesis
