Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLMs

Haruka Asanuma; Naoko Koide-Majima; Ken Nakamura; Takato Horii; Shinji Nishimoto; Masafumi Oizumi

arXiv:2505.12746·cs.AI·May 26, 2025

Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLMs

Haruka Asanuma, Naoko Koide-Majima, Ken Nakamura, Takato Horii, Shinji Nishimoto, Masafumi Oizumi

PDF

Open Access

TL;DR

This study evaluates how well multimodal large language models (MLLMs) replicate the complex, high-dimensional emotional responses humans have to videos, finding they capture category-level structures but struggle with individual emotion details.

Contribution

The paper introduces a comparative analysis of human and MLLM-generated emotion structures, highlighting the models' strengths at category-level inference and limitations at the single-item level.

Findings

01

Strong correlation between human and model emotion structures at the overall level.

02

Models effectively infer emotion categories elicited by videos.

03

Limitations exist in accurately capturing detailed emotion nuances at the single-item level.

Abstract

Recent studies have revealed that human emotions exhibit a high-dimensional, complex structure. A full capturing of this complexity requires new approaches, as conventional models that disregard high dimensionality risk overlooking key nuances of human emotions. Here, we examined the extent to which the latest generation of rapidly evolving Multimodal Large Language Models (MLLMs) capture these high-dimensional, intricate emotion structures, including capabilities and limitations. Specifically, we compared self-reported emotion ratings from participants watching videos with model-generated estimates (e.g., Gemini or GPT). We evaluated performance not only at the individual video level but also from emotion structures that account for inter-video relationships. At the level of simple correlation between emotion structures, our results demonstrated strong similarity between human and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face Recognition and Perception · Sentiment Analysis and Opinion Mining