Temporal Multimodal Fusion for Video Emotion Classification in the Wild
Valentin Vielzeuf, St\'ephane Pateux, Fr\'ed\'eric Jurie

TL;DR
This paper proposes a novel multimodal and temporal fusion approach for video emotion classification, introducing improved face descriptors and a hierarchical fusion method, achieving competitive results on the Emotion in the Wild challenge.
Contribution
It introduces new face descriptors, a hierarchical fusion method, and a CNN architecture tailored for small datasets in video emotion classification.
Findings
Achieved 58.8% accuracy on the Emotion in the Wild challenge.
Ranked 4th in the 2017 challenge.
Demonstrated the effectiveness of hierarchical multimodal fusion.
Abstract
This paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework -- lying in describing videos by audio and visual features used by a supervised classifier to infer the labels -- this paper investigates several novel directions. First of all, improved face descriptors based on 2D and 3D Convo-lutional Neural Networks are proposed. Second, the paper explores several fusion methods, temporal and multimodal, including a novel hierarchical method combining features and scores. In addition, we carefully reviewed the different stages of the pipeline and designed a CNN architecture adapted to the task; this is important as the size of the training set is small compared to the difficulty of the problem, making…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
