Temporal aggregation of audio-visual modalities for emotion recognition
Andreea Birhala, Catalin Nicolae Ristea, Anamaria Radoi, Liviu, Cristian Dutu

TL;DR
This paper introduces a novel multimodal fusion technique that combines audio and visual data over temporal windows for improved emotion recognition, outperforming existing methods and human accuracy on the CREMA-D dataset.
Contribution
The paper proposes a new temporal aggregation method for audio-visual emotion recognition that enhances accuracy by integrating modalities with different temporal offsets.
Findings
Outperforms existing methods on CREMA-D dataset
Achieves higher accuracy than human raters
Demonstrates the effectiveness of temporal window fusion
Abstract
Emotion recognition has a pivotal role in affective computing and in human-computer interaction. The current technological developments lead to increased possibilities of collecting data about the emotional state of a person. In general, human perception regarding the emotion transmitted by a subject is based on vocal and visual information collected in the first seconds of interaction with the subject. As a consequence, the integration of verbal (i.e., speech) and non-verbal (i.e., image) information seems to be the preferred choice in most of the current approaches towards emotion recognition. In this paper, we propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality. We show that our proposed method outperforms other methods from the literature and human accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
