Leveraging Cross-Attention Transformer and Multi-Feature Fusion for   Cross-Linguistic Speech Emotion Recognition

Ruoyu Zhao; Xiantao Jiang; F. Richard Yu; Victor C.M. Leung; Tao Wang,; and Shaohu Zhang

arXiv:2501.10408·eess.AS·January 22, 2025·2 cites

Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition

Ruoyu Zhao, Xiantao Jiang, F. Richard Yu, Victor C.M. Leung, Tao Wang,, and Shaohu Zhang

PDF

Open Access

TL;DR

This paper introduces HuMP-CAT, a cross-linguistic speech emotion recognition method that combines multiple features with a cross-attention transformer and transfer learning, achieving high accuracy across diverse languages.

Contribution

The study presents a novel multi-feature fusion approach with cross-attention and transfer learning for improved cross-linguistic speech emotion recognition.

Findings

01

HuMP-CAT achieves an average accuracy of 78.75% across seven datasets.

02

It attains 88.69% accuracy on German EMODB and 79.48% on Italian EMOVO.

03

The method outperforms existing approaches in multi-language emotion recognition.

Abstract

Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in linguistic and acoustic features of different languages. In this study, we propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source dataset to train the source model and evaluate the proposed method on seven datasets in five languages (e.g., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Emotion and Mood Recognition