Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound
Dengming Zhang, Weitao You, Jingxiong Li, Weishen Lin, Wenda Shi, Xue Zhao, Heda Zuo, Junxian Wu, Lingyun Sun

TL;DR
This paper introduces VAEmotionLLM, a novel two-stage framework that enables vision-language models to understand artistic emotions from sight and sound with limited audio pretraining, improving cross-modal emotion understanding.
Contribution
The paper proposes a new approach to teach vision-language models to perceive and interpret emotions across modalities without extensive audio pretraining, using vision-guided audio alignment and a cross-modal emotion adapter.
Findings
Achieves state-of-the-art results on ArtEmoBenchmark.
Effectively aligns audio and visual modalities with limited pretraining.
Enhances emotion understanding through the proposed cross-modal framework.
Abstract
Emotion understanding is critical for making Large Language Models (LLMs) more general, reliable, and aligned with humans. Art conveys emotion through the joint design of visual and auditory elements, yet most prior work is human-centered or single-modality, overlooking the emotion intentionally expressed by the artwork. Meanwhile, current Audio-Visual Language Models (AVLMs) typically require large-scale audio pretraining to endow Visual Language Models (VLMs) with hearing, which limits scalability. We present Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM), a two-stage framework that teaches a VLM to hear by seeing with limited audio pretraining and to understand emotion across modalities. In Stage 1, Vision-Guided Audio Alignment (VG-Align) distills the frozen visual pathway into a new audio pathway by aligning next-token distributions of the shared LLM on synchronized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Emotion and Mood Recognition · Generative Adversarial Networks and Image Synthesis
