EmoVLM-KD: Fusing Distilled Expertise with Vision-Language Models for Visual Emotion Analysis
SangEun Lee, Yubeen Lee, Eunil Park

TL;DR
EmoVLM-KD introduces a novel approach that combines instruction-tuned vision-language models with a distilled vision module to improve visual emotion analysis performance efficiently.
Contribution
The paper presents EmoVLM-KD, a method that distills knowledge from conventional vision models into vision-language models for enhanced emotion prediction.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Maintains computational efficiency compared to dual-model approaches.
Effectively balances predictions from vision-language and vision models.
Abstract
Visual emotion analysis, which has gained considerable attention in the field of affective computing, aims to predict the dominant emotions conveyed by an image. Despite advancements in visual emotion analysis with the emergence of vision-language models, we observed that instruction-tuned vision-language models and conventional vision models exhibit complementary strengths in visual emotion analysis, as vision-language models excel in certain cases, whereas vision models perform better in others. This finding highlights the need to integrate these capabilities to enhance the performance of visual emotion analysis. To bridge this gap, we propose EmoVLM-KD, an instruction-tuned vision-language model augmented with a lightweight module distilled from conventional vision models. Instead of deploying both models simultaneously, which incurs high computational costs, we transfer the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Emotion and Mood Recognition · Multimodal Machine Learning Applications
