TL;DR
VISAFF is a novel, tuning-free framework that enhances emotion recognition in conversation by focusing on active speakers' visual cues and integrating multi-modal data efficiently.
Contribution
It introduces a speaker-centered visual affective feature learning method that avoids heavy fine-tuning of large vision-language models, improving efficiency and performance.
Findings
VISAFF achieves competitive results on real-world datasets.
The framework significantly reduces computational costs.
It effectively leverages multi-modal cues for emotion recognition.
Abstract
Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers' emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they inherently neglect vital non-verbal information. While recent Vision-Language Models (VLMs) address this by analyzing video directly, they are not inherently tailored for ERC and often focus on emotionally irrelevant background regions or passive listeners rather than the active speaker. Furthermore, fine-tuning these large models incurs prohibitive computational costs. Additionally, isolated visual signals are frequently ambiguous or technically compromised without the context of linguistic content and vocal prosody. To address these challenges, we propose VISAFF, a speaker-centered VISual AFFective feature learning framework for ERC.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
