Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition
Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen, Shiliang, Zhang, Xie Chen

TL;DR
This paper enhances speech emotion recognition by integrating advanced speech models, large language models, and emotional speech synthesis, demonstrating improved performance through various data augmentation techniques on the IEMOCAP dataset.
Contribution
It introduces a novel combination of speech PTM, GPT-4, and Azure TTS for generating synthetic emotional data to improve SER accuracy.
Findings
Data2Vec shows strong representation ability for SER.
Synthetic emotional speech improves recognition performance.
Data augmentation methods outperform other techniques.
Abstract
In this paper, we explored how to boost speech emotion recognition (SER) with the state-of-the-art speech pre-trained model (PTM), data2vec, text generation technique, GPT-4, and speech synthesis technique, Azure TTS. First, we investigated the representation ability of different speech self-supervised pre-trained models, and we found that data2vec has a good representation ability on the SER task. Second, we employed a powerful large language model (LLM), GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate emotionally congruent text and speech. We carefully designed the text prompt and dataset construction, to obtain the synthetic emotional speech data with high quality. Third, we studied different ways of data augmentation to promote the SER task with synthetic speech, including random mixing, adversarial training, transfer learning, and curriculum learning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Sentiment Analysis and Opinion Mining · Emotion and Mood Recognition
MethodsAttention Is All You Need · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Adam · Linear Layer · Multi-Head Attention · Dropout
