Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study   on Word Error Rate and Fusion Techniques

Yuanchao Li; Peter Bell; Catherine Lai

arXiv:2406.08353·eess.AS·March 25, 2025

Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

Yuanchao Li, Peter Bell, Catherine Lai

PDF

Open Access 1 Repo

TL;DR

This study evaluates how different levels of ASR errors impact Speech Emotion Recognition performance across multiple datasets and proposes a unified framework to improve robustness and accuracy in real-world scenarios.

Contribution

It benchmarks SER performance with various ASR transcripts and introduces a novel error-robust framework combining error correction and fusion techniques.

Findings

01

ASR errors significantly affect SER accuracy.

02

The proposed framework outperforms existing methods in noisy conditions.

03

Fusion techniques improve robustness against ASR errors.

Abstract

Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes both text-only and bimodal SER with six fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yc-li20/SER-on-WER-and-Fusion
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Speech and Audio Processing