Self-Supervised Attention Networks and Uncertainty Loss Weighting for   Multi-Task Emotion Recognition on Vocal Bursts

Vincent Karas; Andreas Triantafyllopoulos; Meishu Song; Bj\"orn W.; Schuller

arXiv:2209.07384·cs.SD·September 28, 2022

Self-Supervised Attention Networks and Uncertainty Loss Weighting for Multi-Task Emotion Recognition on Vocal Bursts

Vincent Karas, Andreas Triantafyllopoulos, Meishu Song, Bj\"orn W., Schuller

PDF

Open Access

TL;DR

This paper introduces a multi-task emotion recognition method for vocal bursts using self-supervised audio features, attention networks, and uncertainty loss weighting, achieving significant improvements over baseline models.

Contribution

It proposes a novel multi-task framework combining self-supervised features, attention mechanisms, and uncertainty-based loss weighting for vocal burst emotion recognition.

Findings

01

Outperforms baseline on all four challenge tasks

02

Uses large self-supervised audio models for feature extraction

03

Demonstrates effectiveness of attention networks and uncertainty weighting

Abstract

Vocal bursts play an important role in communicating affect, making them valuable for improving speech emotion recognition. Here, we present our approach for classifying vocal bursts and predicting their emotional significance in the ACII Affective Vocal Burst Workshop & Challenge 2022 (A-VB). We use a large self-supervised audio model as shared feature extractor and compare multiple architectures built on classifier chains and attention networks, combined with uncertainty loss weighting strategies. Our approach surpasses the challenge baseline by a wide margin on all four tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing