VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

Hao Cheng; Zhiwei Zhao; Yichao He; Zhenzhen Hu; Jia Li; Meng Wang; Richang Hong

arXiv:2505.02331·cs.CV·August 5, 2025

VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

Hao Cheng, Zhiwei Zhao, Yichao He, Zhenzhen Hu, Jia Li, Meng Wang, Richang Hong

PDF

1 Repo

TL;DR

VAEmo introduces a two-stage framework that combines self-supervised multimodal representation learning with external knowledge injection to improve audiovisual emotion recognition.

Contribution

The paper proposes a novel, efficient two-stage VAEmo framework that enhances emotion-centric VA representations through knowledge injection and contrastive learning.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Uses a lightweight, unified model for cross-modal encoding.

03

Demonstrates improved emotion recognition accuracy with external knowledge.

Abstract

Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage~1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MSA-LMC/VAEmo
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.