TL;DR
VAEmo introduces a two-stage framework that combines self-supervised multimodal representation learning with external knowledge injection to improve audiovisual emotion recognition.
Contribution
The paper proposes a novel, efficient two-stage VAEmo framework that enhances emotion-centric VA representations through knowledge injection and contrastive learning.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Uses a lightweight, unified model for cross-modal encoding.
Demonstrates improved emotion recognition accuracy with external knowledge.
Abstract
Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage~1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
