Unsupervised Deep Representations for Learning Audience Facial Behaviors
Suman Saha, Rajitha Navarathna, Leonhard Helminger, Romann Weber

TL;DR
This paper introduces an unsupervised deep learning method combining VAE and GAN to analyze audience facial behaviors, effectively capturing engagement and disengagement signals from video footage without labeled data.
Contribution
It presents a novel unsupervised approach that jointly trains VAE and GAN to learn meaningful facial behavior representations from unlabeled audience footage.
Findings
Successfully encodes audience engagement signals like smiling and laughing.
Effectively detects disengagement cues such as yawning.
Provides a proof of concept for annotating complex multimedia data without labels.
Abstract
In this paper, we present an unsupervised learning approach for analyzing facial behavior based on a deep generative model combined with a convolutional neural network (CNN). We jointly train a variational auto-encoder (VAE) and a generative adversarial network (GAN) to learn a powerful latent representation from footage of audiences viewing feature-length movies. We show that the learned latent representation successfully encodes meaningful signatures of behaviors related to audience engagement (smiling & laughing) and disengagement (yawning). Our results provide a proof of concept for a more general methodology for annotating hard-to-label multimedia data featuring sparse examples of signals of interest.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
