Jointly Predicting Emotion, Age, and Country Using Pre-Trained Acoustic Embedding
Bagus Tris Atmaja, Zanjabila, and Akira Sasou

TL;DR
This study demonstrates how pre-trained acoustic embeddings can be used in multitask learning to predict emotion, age, and country from speech, showing benefits over traditional features.
Contribution
It introduces a multitask learning framework using wav2vec 2.0 embeddings for simultaneous prediction of emotion, age, and country, highlighting the effectiveness of pre-trained models.
Findings
Pre-trained acoustic embeddings improve prediction accuracy.
Multitask learning with shared representations benefits all tasks.
Different acoustic features and normalization methods impact performance.
Abstract
In this paper, we demonstrated the benefit of using pre-trained model to extract acoustic embedding to jointly predict (multitask learning) three tasks: emotion, age, and native country. The pre-trained model was trained with wav2vec 2.0 large robust model on the speech emotion corpus. The emotion and age tasks were regression problems, while country prediction was a classification task. A single harmonic mean from three metrics was used to evaluate the performance of multitask learning. The classifier was a linear network with two independent layers and shared layers, including the output layers. This study explores multitask learning on different acoustic features (including the acoustic embedding extracted from a model trained on an affective speech dataset), seed numbers, batch sizes, and normalizations for predicting paralinguistic information from speech.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Music and Audio Processing
