EmoNets: Multimodal deep learning approaches for emotion recognition in   video

Samira Ebrahimi Kahou; Xavier Bouthillier; Pascal Lamblin; Caglar; Gulcehre; Vincent Michalski; Kishore Konda; S\'ebastien Jean; Pierre; Froumenty; Yann Dauphin; Nicolas Boulanger-Lewandowski; Raul Chandias; Ferrari; Mehdi Mirza; David Warde-Farley; Aaron Courville; Pascal Vincent,; Roland Memisevic; Christopher Pal; Yoshua Bengio

arXiv:1503.01800·cs.LG·March 31, 2015·41 cites

EmoNets: Multimodal deep learning approaches for emotion recognition in video

Samira Ebrahimi Kahou, Xavier Bouthillier, Pascal Lamblin, Caglar, Gulcehre, Vincent Michalski, Kishore Konda, S\'ebastien Jean, Pierre, Froumenty, Yann Dauphin, Nicolas Boulanger-Lewandowski, Raul Chandias, Ferrari, Mehdi Mirza, David Warde-Farley, Aaron Courville

PDF

Open Access

TL;DR

This paper introduces EmoNets, a multimodal deep learning framework for emotion recognition in videos, combining visual, audio, and spatio-temporal features, achieving state-of-the-art results in the EmotiW challenge.

Contribution

It presents a novel multimodal deep learning approach with specialized models for each modality and effective fusion techniques, winning the 2013 EmotiW challenge.

Findings

01

Achieved 47.67% accuracy on the 2014 EmotiW dataset.

02

Multimodal fusion outperforms single-modality classifiers.

03

Winning submission in the 2013 EmotiW challenge.

Abstract

The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based "bag-of-mouths" model, which extracts visual features around the mouth region and a relational autoencoder, which addresses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face recognition and analysis · Human Pose and Action Recognition