Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization
Hsuan-Yu Wang, Pei-Ying Lee, Berlin Chen

TL;DR
This paper demonstrates that aligning timestamps between ASR transcripts and speaker diarization outputs significantly improves speech emotion recognition accuracy in multimodal systems, especially in conversational settings.
Contribution
It introduces a timestamp alignment pipeline and a multimodal fusion approach that enhances emotion recognition by synchronizing textual and audio data.
Findings
Timestamp alignment improves SER accuracy on IEMOCAP dataset.
Multimodal fusion with cross-attention and gating outperforms baseline methods.
Precise synchronization enhances robustness of emotion analysis in conversations.
Abstract
In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. To address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERTa with audio embeddings from Wav2Vec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining
