VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints

Jinghua Tang; Liyun Zhang; Yu Lu; Dian Ding; Lanqing Yang; YiChao; Chen; Minjie Bian; Xiaoshan Li; Guangtao Xue

arXiv:2408.13019·cs.MM·August 26, 2024

VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints

Jinghua Tang, Liyun Zhang, Yu Lu, Dian Ding, Lanqing Yang, YiChao, Chen, Minjie Bian, Xiaoshan Li, Guangtao Xue

PDF

Open Access

TL;DR

This paper introduces the VCEMO dataset for Chinese voiceprint emotion recognition and proposes a multimodal model that fuses speech, text, and external knowledge, achieving state-of-the-art results.

Contribution

It provides a new high-quality Chinese voiceprint emotion dataset and a multimodal fusion model with contrastive learning for improved emotion recognition.

Findings

01

Significant improvement over SOTA on VCEMO and IEMOCAP datasets

02

Effective fusion of speech, text, and external knowledge

03

Addresses dataset scarcity for Chinese voiceprint emotion recognition

Abstract

Emotion recognition can enhance humanized machine responses to user commands, while voiceprint-based perception systems can be easily integrated into commonly used devices like smartphones and stereos. Despite having the largest number of speakers, there is a noticeable absence of high-quality corpus datasets for emotion recognition using Chinese voiceprints. Hence, this paper introduces the VCEMO dataset to address this deficiency. The proposed dataset is constructed from everyday conversations and comprises over 100 users and 7,747 textual samples. Furthermore, this paper proposes a multimodal-based model as a benchmark, which effectively fuses speech, text, and external knowledge using a co-attention structure. The system employs contrastive learning-based regulation for the uneven distribution of the dataset and the diversity of emotional expressions. The experiments demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis