Pre-trained Model Representations and their Robustness against Noise for Speech Emotion Analysis
Vikramjit Mitra, Vasudha Kowtha, Hsiang-Yun Sherry Chien, Erdrin, Azemi, Carlos Avendano

TL;DR
This paper explores how pre-trained speech models can estimate emotions like activation, valence, and dominance, demonstrating improved accuracy and robustness against noise through multi-modal fusion and knowledge distillation.
Contribution
It introduces a multi-modal fusion approach for speech emotion estimation and analyzes the robustness of lexical and acoustic representations under noise conditions.
Findings
Achieved 100% and 30% relative improvements in valence estimation CCC.
Lexical representations are more robust to noise than acoustic ones.
Knowledge distillation enhances noise robustness of acoustic models.
Abstract
Pre-trained model representations have demonstrated state-of-the-art performance in speech recognition, natural language processing, and other applications. Speech models, such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT), have enabled generating lexical and acoustic representations to benefit speech recognition applications. We investigated the use of pre-trained model representations for estimating dimensional emotions, such as activation, valence, and dominance, from speech. We observed that while valence may rely heavily on lexical representations, activation and dominance rely mostly on acoustic information. In this work, we used multi-modal fusion representations from pre-trained models to generate state-of-the-art speech emotion estimation, and we showed a 100% and 30% relative improvement in concordance correlation coefficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Adam · Dropout · Softmax · Dense Connections · Weight Decay
