Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual   Emotion Recognition

Tong Shi; Xuri Ge; Joemon M. Jose; Nicolas Pugeault; Paul Henderson

arXiv:2405.16701·cs.CV·May 28, 2024

Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition

Tong Shi, Xuri Ge, Joemon M. Jose, Nicolas Pugeault, Paul Henderson

PDF

Open Access

TL;DR

This paper introduces DE-III, a novel network for audio-visual emotion recognition that leverages optical flow and attentive feature enhancement to better capture facial details and improve recognition accuracy.

Contribution

The paper proposes a new DE-III network incorporating optical flow and intra- and inter-modal attention modules for enhanced emotion recognition.

Findings

01

Outperforms existing methods on three benchmark datasets

02

Effectively captures facial state changes with optical flow

03

Improves feature discriminability for emotion recognition

Abstract

Capturing complex temporal relationships between video and audio modalities is vital for Audio-Visual Emotion Recognition (AVER). However, existing methods lack attention to local details, such as facial state changes between video frames, which can reduce the discriminability of features and thus lower recognition accuracy. In this paper, we propose a Detail-Enhanced Intra- and Inter-modal Interaction network(DE-III) for AVER, incorporating several novel aspects. We introduce optical flow information to enrich video representations with texture details that better capture facial state changes. A fusion module integrates the optical flow estimation with the corresponding video frames to enhance the representation of facial texture variations. We also design attentive intra- and inter-modal feature enhancement modules to further improve the richness and discriminability of video and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Emotion and Mood Recognition