Training A Small Emotional Vision Language Model for Visual Art   Comprehension

Jing Zhang; Liang Zheng; Meng Wang; and Dan Guo

arXiv:2403.11150·cs.CV·July 11, 2024·2 cites

Training A Small Emotional Vision Language Model for Visual Art Comprehension

Jing Zhang, Liang Zheng, Meng Wang, and Dan Guo

PDF

Open Access 2 Repos 1 Models

TL;DR

This paper introduces a small, efficient vision-language model for understanding visual art's emotional content, leveraging emotion modeling and feature alignment to outperform existing small models and compete with larger ones.

Contribution

The paper proposes a novel small emotional vision language model (SEVLM) that incorporates emotion features and contrastive learning to enhance visual art comprehension.

Findings

01

SEVLM outperforms baseline small models in emotion understanding.

02

The model is competitive with larger models like LLaVA 7B after fine-tuning.

03

Efficient training on a single RTX 2080 Ti achieves strong performance.

Abstract

This paper develops small vision language models to understand visual art, which, given an art work, aims to identify its emotion category and explain this prediction with natural language. While small models are computationally efficient, their capacity is much limited compared with large models. To break this trade-off, this paper builds a small emotional vision language model (SEVLM) by emotion modeling and input-output feature alignment. On the one hand, based on valence-arousal-dominance (VAD) knowledge annotated by psychology experts, we introduce and fuse emotional features derived through VAD dictionary and a VAD head to align VAD vectors of predicted emotion explanation and the ground truth. This allows the vision language model to better understand and generate emotional texts, compared with using traditional text embeddings alone. On the other hand, we design a contrastive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
jing5566/SEVLM
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media and Visual Art · Color perception and design

MethodsALIGN