Fusing ASR Outputs in Joint Training for Speech Emotion Recognition
Yuanchao Li, Peter Bell, Catherine Lai

TL;DR
This paper explores the integration of ASR outputs into joint training for Speech Emotion Recognition, demonstrating that hierarchical fusion of ASR features enhances SER performance close to ground-truth transcript levels.
Contribution
It introduces a hierarchical co-attention fusion method for combining ASR outputs with SER, providing new insights into the ASR-SER relationship and improving emotion recognition accuracy.
Findings
Hierarchical fusion of ASR outputs improves SER accuracy.
Close to ground-truth transcript performance achieved on IEMOCAP.
Layer-difference analysis of Wav2vec 2.0 enhances understanding of ASR-SER relationship.
Abstract
Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training SER. The relationship between ASR and SER is understudied, and it is unclear what and how ASR features benefit SER. By examining various ASR outputs and fusion methods, our experiments show that in joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most. On the IEMOCAP corpus, our approach achieves 63.4% weighted accuracy, which is close to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
