Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLMs
Haruka Asanuma, Naoko Koide-Majima, Ken Nakamura, Takato Horii, Shinji Nishimoto, Masafumi Oizumi

TL;DR
This study evaluates how well multimodal large language models (MLLMs) replicate the complex, high-dimensional emotional responses humans have to videos, finding they capture category-level structures but struggle with individual emotion details.
Contribution
The paper introduces a comparative analysis of human and MLLM-generated emotion structures, highlighting the models' strengths at category-level inference and limitations at the single-item level.
Findings
Strong correlation between human and model emotion structures at the overall level.
Models effectively infer emotion categories elicited by videos.
Limitations exist in accurately capturing detailed emotion nuances at the single-item level.
Abstract
Recent studies have revealed that human emotions exhibit a high-dimensional, complex structure. A full capturing of this complexity requires new approaches, as conventional models that disregard high dimensionality risk overlooking key nuances of human emotions. Here, we examined the extent to which the latest generation of rapidly evolving Multimodal Large Language Models (MLLMs) capture these high-dimensional, intricate emotion structures, including capabilities and limitations. Specifically, we compared self-reported emotion ratings from participants watching videos with model-generated estimates (e.g., Gemini or GPT). We evaluated performance not only at the individual video level but also from emotion structures that account for inter-video relationships. At the level of simple correlation between emotion structures, our results demonstrated strong similarity between human and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face Recognition and Perception · Sentiment Analysis and Opinion Mining
