MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models
Tianwei Chen, Takuya Furusawa, Yuki Hirakawa, Ryotaro Shimizu, Mo Fan, Takashi Wada

TL;DR
This paper presents MultiEmo-Bench, a new multi-label visual emotion analysis dataset for evaluating multimodal large language models, revealing their current capabilities and limitations in predicting complex emotional responses to images.
Contribution
The paper introduces a comprehensive multi-label benchmark dataset for visual emotion analysis, addressing limitations of previous single-label annotations and enabling more accurate evaluation of MLLMs.
Findings
Recent MLLMs show progress in emotion prediction.
Evaluation reveals substantial room for improvement.
LLM-as-a-judge does not consistently enhance performance.
Abstract
This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
