Whose Emotion Matters? Speaking Activity Localisation without Prior Knowledge
Hugo Carneiro, Cornelius Weber, Stefan Wermter

TL;DR
This paper improves emotion recognition in conversations by realigning MELD videos using active speaker detection and speech recognition, enabling better facial expression analysis and outperforming vision-only models.
Contribution
It introduces MELD-FAIR, a realigned version of MELD with accurate speaker localization, and demonstrates enhanced emotion recognition performance using this data.
Findings
Realigned MELD-FAIR videos match transcriptions more closely.
Emotion recognition model trained on MELD-FAIR outperforms vision-only state-of-the-art.
Facial cues from localized speakers are more informative for ERC.
Abstract
The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as provided, for example, in the video-based Multimodal EmotionLines Dataset (MELD). However, only a few research approaches use both acoustic and visual information from the MELD videos. There are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same scene, which requires the localisation of the utterance source. In this paper, we introduce MELD with Fixed Audiovisual Information via Realignment (MELD-FAIR) by using recent active speaker detection and automatic speech recognition models, we are able to realign the videos of MELD and capture the facial expressions from speakers in 96.92% of the utterances provided in MELD.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Speech Recognition and Synthesis
