$M^3$T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild

Yuan-Hang Zhang; Rulin Huang; Jiabei Zeng; Shiguang Shan; Xilin; Chen

arXiv:2002.02957·cs.CV·February 10, 2020·6 cites

$M^3$T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild

Yuan-Hang Zhang, Rulin Huang, Jiabei Zeng, Shiguang Shan, Xilin, Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-modal, multi-task framework that combines visual and acoustic features for continuous valence-arousal estimation in the wild, achieving significant improvements over baseline methods.

Contribution

The novel $M^3$T framework effectively fuses visual and audio data and leverages task correlations for improved valence-arousal estimation in unconstrained environments.

Findings

01

Significantly outperforms baseline on ABAW validation set

02

Effective multi-modal fusion of video and audio features

03

Utilizes multi-task learning to exploit emotion correlations

Abstract

This report describes a multi-modal multi-task ( $M^{3}$ T) approach underlying our submission to the valence-arousal estimation track of the Affective Behavior Analysis in-the-wild (ABAW) Challenge, held in conjunction with the IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2020. In the proposed $M^{3}$ T framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal. The spatio-temporal visual features are extracted with a 3D convolutional network and a bidirectional recurrent neural network. Considering the correlations between valence / arousal, emotions, and facial actions, we also explores mechanisms to benefit from other tasks. We evaluated the $M^{3}$ T framework on the validation set provided by ABAW and it significantly outperforms the baseline method.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sailordiary/m3t.pytorch
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Infant Health and Development