Cross-Lingual Speech Emotion Recognition: Humans vs. Self-Supervised Models
Zhichen Han, Tianqi Geng, Hui Feng, Jiahong Yuan, Korin Richmond,, Yuanchao Li

TL;DR
This study compares human and self-supervised model performance in cross-lingual speech emotion recognition, revealing models can adapt effectively with knowledge transfer and highlighting dialect's impact on emotion perception.
Contribution
It provides a comprehensive analysis of SSL models versus humans in cross-lingual SER, including layer-wise, fine-tuning, and dialect effects, which is novel in the field.
Findings
Models can adapt to new languages with knowledge transfer.
Dialect significantly affects emotion recognition accuracy.
Humans and models show different emotion recognition behaviors.
Abstract
Utilizing Self-Supervised Learning (SSL) models for Speech Emotion Recognition (SER) has proven effective, yet limited research has explored cross-lingual scenarios. This study presents a comparative analysis between human performance and SSL models, beginning with a layer-wise analysis and an exploration of parameter-efficient fine-tuning strategies in monolingual, cross-lingual, and transfer learning contexts. We further compare the SER ability of models and humans at both utterance- and segment-levels. Additionally, we investigate the impact of dialect on cross-lingual SER through human evaluation. Our findings reveal that models, with appropriate knowledge transfer, can adapt to the target language and achieve performance comparable to native speakers. We also demonstrate the significant effect of dialect on SER for individuals without prior linguistic and paralinguistic background.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
