Socratis: Are large multimodal models emotionally aware?
Katherine Deng, Arijit Ray, Reuben Tan, Saadia Gabriel, Bryan A., Plummer, Kate Saenko

TL;DR
Socratis introduces a new benchmark for multimodal emotion understanding, capturing diverse human reactions to images and captions, and evaluates the ability of large models to generate emotionally relevant explanations.
Contribution
The paper presents Socratis, a comprehensive dataset with multi-emotion annotations and reasons, and benchmarks multimodal models' capacity to generate human-like emotional explanations.
Findings
Humans prefer human-written reasons over machine-generated ones.
Current models struggle to generate emotionally accurate explanations.
Existing captioning metrics do not align with human preferences.
Abstract
Existing emotion prediction benchmarks contain coarse emotion labels which do not consider the diversity of emotions that an image and text can elicit in humans due to various reasons. Learning diverse reactions to multimodal content is important as intelligent machines take a central role in generating and delivering content to society. To address this gap, we propose Socratis, a societal reactions benchmark, where each image-caption (IC) pair is annotated with multiple emotions and the reasons for feeling them. Socratis contains 18K free-form reactions for 980 emotions on 2075 image-caption pairs from 5 widely-read news and image-caption (IC) datasets. We benchmark the capability of state-of-the-art multimodal large language models to generate the reasons for feeling an emotion given an IC pair. Based on a preliminary human study, we observe that humans prefer human-written reasons…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Language, Metaphor, and Cognition
Methodsfail · Attentive Walk-Aggregating Graph Neural Network
