Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound

Dengming Zhang; Weitao You; Jingxiong Li; Weishen Lin; Wenda Shi; Xue Zhao; Heda Zuo; Junxian Wu; Lingyun Sun

arXiv:2511.12077·cs.CV·December 2, 2025

Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound

Dengming Zhang, Weitao You, Jingxiong Li, Weishen Lin, Wenda Shi, Xue Zhao, Heda Zuo, Junxian Wu, Lingyun Sun

PDF

Open Access

TL;DR

This paper introduces VAEmotionLLM, a novel two-stage framework that enables vision-language models to understand artistic emotions from sight and sound with limited audio pretraining, improving cross-modal emotion understanding.

Contribution

The paper proposes a new approach to teach vision-language models to perceive and interpret emotions across modalities without extensive audio pretraining, using vision-guided audio alignment and a cross-modal emotion adapter.

Findings

01

Achieves state-of-the-art results on ArtEmoBenchmark.

02

Effectively aligns audio and visual modalities with limited pretraining.

03

Enhances emotion understanding through the proposed cross-modal framework.

Abstract

Emotion understanding is critical for making Large Language Models (LLMs) more general, reliable, and aligned with humans. Art conveys emotion through the joint design of visual and auditory elements, yet most prior work is human-centered or single-modality, overlooking the emotion intentionally expressed by the artwork. Meanwhile, current Audio-Visual Language Models (AVLMs) typically require large-scale audio pretraining to endow Visual Language Models (VLMs) with hearing, which limits scalability. We present Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM), a two-stage framework that teaches a VLM to hear by seeing with limited audio pretraining and to understand emotion across modalities. In Stage 1, Vision-Guided Audio Alignment (VG-Align) distills the frozen visual pathway into a new audio pathway by aligning next-token distributions of the shared LLM on synchronized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Emotion and Mood Recognition · Generative Adversarial Networks and Image Synthesis