AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

Dong She; Xianrong Yao; Liqun Chen; Jinghe Yu; Yang Gao; Zhanpeng Jin

arXiv:2604.05900·cs.CV·April 8, 2026

AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

Dong She, Xianrong Yao, Liqun Chen, Jinghe Yu, Yang Gao, Zhanpeng Jin

PDF

TL;DR

AICA-Bench introduces a comprehensive benchmark for evaluating vision-language models in holistic affective image content analysis, addressing perception, reasoning, and generation tasks.

Contribution

The paper presents AICA-Bench, a new benchmark with three core tasks and proposes GAT Prompting, a training-free framework to improve VLMs' affective understanding.

Findings

01

GAT reduces intensity calibration errors.

02

GAT enhances descriptive depth in open-ended tasks.

03

Evaluation of 23 VLMs reveals key limitations in affective content analysis.

Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.