Self-Supervised Attention Networks and Uncertainty Loss Weighting for Multi-Task Emotion Recognition on Vocal Bursts
Vincent Karas, Andreas Triantafyllopoulos, Meishu Song, Bj\"orn W., Schuller

TL;DR
This paper introduces a multi-task emotion recognition method for vocal bursts using self-supervised audio features, attention networks, and uncertainty loss weighting, achieving significant improvements over baseline models.
Contribution
It proposes a novel multi-task framework combining self-supervised features, attention mechanisms, and uncertainty-based loss weighting for vocal burst emotion recognition.
Findings
Outperforms baseline on all four challenge tasks
Uses large self-supervised audio models for feature extraction
Demonstrates effectiveness of attention networks and uncertainty weighting
Abstract
Vocal bursts play an important role in communicating affect, making them valuable for improving speech emotion recognition. Here, we present our approach for classifying vocal bursts and predicting their emotional significance in the ACII Affective Vocal Burst Workshop & Challenge 2022 (A-VB). We use a large self-supervised audio model as shared feature extractor and compare multiple architectures built on classifier chains and attention networks, combined with uncertainty loss weighting strategies. Our approach surpasses the challenge baseline by a wide margin on all four tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
