VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

Linan ZHU; Zihao Zhai; Xiao Han; Yuqian Fu; Xiangfan Chen; Xiangjie Kong; Guojiang Shen

arXiv:2605.18547·cs.AI·May 19, 2026

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

Linan ZHU, Zihao Zhai, Xiao Han, Yuqian Fu, Xiangfan Chen, Xiangjie Kong, Guojiang Shen

PDF

1 Repo

TL;DR

VISAFF is a novel, tuning-free framework that enhances emotion recognition in conversation by focusing on active speakers' visual cues and integrating multi-modal data efficiently.

Contribution

It introduces a speaker-centered visual affective feature learning method that avoids heavy fine-tuning of large vision-language models, improving efficiency and performance.

Findings

01

VISAFF achieves competitive results on real-world datasets.

02

The framework significantly reduces computational costs.

03

It effectively leverages multi-modal cues for emotion recognition.

Abstract

Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers' emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they inherently neglect vital non-verbal information. While recent Vision-Language Models (VLMs) address this by analyzing video directly, they are not inherently tailored for ERC and often focus on emotionally irrelevant background regions or passive listeners rather than the active speaker. Furthermore, fine-tuning these large models incurs prohibitive computational costs. Additionally, isolated visual signals are frequently ambiguous or technically compromised without the context of linguistic content and vocal prosody. To address these challenges, we propose VISAFF, a speaker-centered VISual AFFective feature learning framework for ERC.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/speaker-2365
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.