Training A Small Emotional Vision Language Model for Visual Art Comprehension
Jing Zhang, Liang Zheng, Meng Wang, and Dan Guo

TL;DR
This paper introduces a small, efficient vision-language model for understanding visual art's emotional content, leveraging emotion modeling and feature alignment to outperform existing small models and compete with larger ones.
Contribution
The paper proposes a novel small emotional vision language model (SEVLM) that incorporates emotion features and contrastive learning to enhance visual art comprehension.
Findings
SEVLM outperforms baseline small models in emotion understanding.
The model is competitive with larger models like LLaVA 7B after fine-tuning.
Efficient training on a single RTX 2080 Ti achieves strong performance.
Abstract
This paper develops small vision language models to understand visual art, which, given an art work, aims to identify its emotion category and explain this prediction with natural language. While small models are computationally efficient, their capacity is much limited compared with large models. To break this trade-off, this paper builds a small emotional vision language model (SEVLM) by emotion modeling and input-output feature alignment. On the one hand, based on valence-arousal-dominance (VAD) knowledge annotated by psychology experts, we introduce and fuse emotional features derived through VAD dictionary and a VAD head to align VAD vectors of predicted emotion explanation and the ground truth. This allows the vision language model to better understand and generate emotional texts, compared with using traditional text embeddings alone. On the other hand, we design a contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media and Visual Art · Color perception and design
MethodsALIGN
