Self-supervision and Learnable STRFs for Age, Emotion, and Country   Prediction

Roshan Sharma; Tyler Vuong; Mark Lindsey; Hira Dhamyal; Rita Singh and; Bhiksha Raj

arXiv:2206.12568·cs.SD·June 28, 2022

Self-supervision and Learnable STRFs for Age, Emotion, and Country Prediction

Roshan Sharma, Tyler Vuong, Mark Lindsey, Hira Dhamyal, Rita Singh and, Bhiksha Raj

PDF

Open Access

TL;DR

This paper introduces a multitask learning approach using self-supervised features and learnable spectro-temporal receptive fields to predict age, emotion, and country of origin from vocal bursts, achieving state-of-the-art results in the ICML ExVo-MultiTask challenge.

Contribution

It proposes a novel combination of self-supervised features, learnable STRFs, and score fusion for multitask vocal analysis, demonstrating improved performance over previous methods.

Findings

01

Score fusion improved prediction accuracy.

02

Self-supervised features enhanced model robustness.

03

Learnable STRFs contributed to better feature extraction.

Abstract

This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio for the 2022 ICML Expressive Vocalizations Challenge ExVo-MultiTask track. The method of choice utilized a combination of spectro-temporal modulation and self-supervised features, followed by an encoder-decoder network organized in a multitask paradigm. We evaluate the complementarity between the tasks posed by examining independent task-specific and joint models, and explore the relative strengths of different feature sets. We also introduce a simple score fusion mechanism to leverage the complementarity of different feature sets for this task. We find that robust data preprocessing in conjunction with score fusion over spectro-temporal receptive field and HuBERT models achieved our best ExVo-MultiTask test score of 0.412.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsTest