deep learning of segment-level feature representation for speech emotion recognition in conversations
Jiachen Luo, Huy Phan, Joshua Reiss

TL;DR
This paper introduces a novel conversational speech emotion recognition approach that leverages segment-based audio features and attentive bi-directional GRUs to effectively model contextual and speaker-dependent emotional cues in dialogues.
Contribution
It proposes a new method combining pretrained VGGish features with attentive bi-directional GRUs for improved emotion recognition in conversations.
Findings
Outperforms state-of-the-art methods on MELD dataset.
Effectively captures contextual and speaker-sensitive emotional information.
Demonstrates robustness in dynamic conversational settings.
Abstract
Accurately detecting emotions in conversation is a necessary yet challenging task due to the complexity of emotions and dynamics in dialogues. The emotional state of a speaker can be influenced by many different factors, such as interlocutor stimulus, dialogue scene, and topic. In this work, we propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions. First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances. Second, an attentive bi-directional gated recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly in a dynamic manner. The experiments conducted on the standard conversational dataset MELD demonstrate the effectiveness of the proposed method when compared against state-of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
