A Multi-Modal Approach to Infer Image Affect
Ashok Sundaresan, Sugumar Murugesan, Sean Davis, Karthik Kappaganthu,, ZhongYi Jin, Divya Jain, Anurag Maunder

TL;DR
This paper introduces a novel multi-modal deep learning approach that combines facial, scene, pose, text, and CNN features to improve image affect inference, marking the first use of all modalities with deep neural networks.
Contribution
It presents a comprehensive multi-modal framework utilizing deep neural networks for all modalities, advancing the state-of-the-art in image affect analysis.
Findings
Improved accuracy over baseline models
All modalities effectively contribute to affect inference
Insights into modality importance and integration
Abstract
The group affect or emotion in an image of people can be inferred by extracting features about both the people in the picture and the overall makeup of the scene. The state-of-the-art on this problem investigates a combination of facial features, scene extraction and even audio tonality. This paper combines three additional modalities, namely, human pose, text-based tagging and CNN extracted features / predictions. To the best of our knowledge, this is the first time all of the modalities were extracted using deep neural networks. We evaluate the performance of our approach against baselines and identify insights throughout this paper.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
