Self-supervision and Learnable STRFs for Age, Emotion, and Country Prediction
Roshan Sharma, Tyler Vuong, Mark Lindsey, Hira Dhamyal, Rita Singh and, Bhiksha Raj

TL;DR
This paper introduces a multitask learning approach using self-supervised features and learnable spectro-temporal receptive fields to predict age, emotion, and country of origin from vocal bursts, achieving state-of-the-art results in the ICML ExVo-MultiTask challenge.
Contribution
It proposes a novel combination of self-supervised features, learnable STRFs, and score fusion for multitask vocal analysis, demonstrating improved performance over previous methods.
Findings
Score fusion improved prediction accuracy.
Self-supervised features enhanced model robustness.
Learnable STRFs contributed to better feature extraction.
Abstract
This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio for the 2022 ICML Expressive Vocalizations Challenge ExVo-MultiTask track. The method of choice utilized a combination of spectro-temporal modulation and self-supervised features, followed by an encoder-decoder network organized in a multitask paradigm. We evaluate the complementarity between the tasks posed by examining independent task-specific and joint models, and explore the relative strengths of different feature sets. We also introduce a simple score fusion mechanism to leverage the complementarity of different feature sets for this task. We find that robust data preprocessing in conjunction with score fusion over spectro-temporal receptive field and HuBERT models achieved our best ExVo-MultiTask test score of 0.412.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsTest
