Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

Yuanchao Li; Peter Bell; Catherine Lai

arXiv:2110.15684·eess.AS·November 11, 2022·5 cites

Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

Yuanchao Li, Peter Bell, Catherine Lai

PDF

Open Access

TL;DR

This paper explores the integration of ASR outputs into joint training for Speech Emotion Recognition, demonstrating that hierarchical fusion of ASR features enhances SER performance close to ground-truth transcript levels.

Contribution

It introduces a hierarchical co-attention fusion method for combining ASR outputs with SER, providing new insights into the ASR-SER relationship and improving emotion recognition accuracy.

Findings

01

Hierarchical fusion of ASR outputs improves SER accuracy.

02

Close to ground-truth transcript performance achieved on IEMOCAP.

03

Layer-difference analysis of Wav2vec 2.0 enhances understanding of ASR-SER relationship.

Abstract

Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training SER. The relationship between ASR and SER is understudied, and it is unclear what and how ASR features benefit SER. By examining various ASR outputs and fusion methods, our experiments show that in joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most. On the IEMOCAP corpus, our approach achieves 63.4% weighted accuracy, which is close to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing