AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
Dong She, Xianrong Yao, Liqun Chen, Jinghe Yu, Yang Gao, Zhanpeng Jin

TL;DR
AICA-Bench introduces a comprehensive benchmark for evaluating vision-language models in holistic affective image content analysis, addressing perception, reasoning, and generation tasks.
Contribution
The paper presents AICA-Bench, a new benchmark with three core tasks and proposes GAT Prompting, a training-free framework to improve VLMs' affective understanding.
Findings
GAT reduces intensity calibration errors.
GAT enhances descriptive depth in open-ended tasks.
Evaluation of 23 VLMs reveals key limitations in affective content analysis.
Abstract
Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
