Grounding Emotion Recognition with Visual Prototypes: VEGA -- Revisiting CLIP in MERC
Guanyu Hu, Dimitrios Kollias, Xinyu Yang

TL;DR
This paper introduces VEGA, a novel multimodal emotion recognition model that leverages CLIP's image encoder to incorporate class-specific visual anchors, improving alignment and performance in conversation emotion recognition tasks.
Contribution
It proposes a new Visual Emotion Guided Anchoring (VEGA) mechanism using CLIP's image encoder to enhance multimodal fusion with psychologically meaningful visual priors.
Findings
Achieves state-of-the-art results on IEMOCAP and MELD datasets.
Demonstrates improved multimodal alignment and robustness.
Utilizes a dual-branch architecture with self-distillation.
Abstract
Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP's textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
