Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track
Jiawen Huang, Chenxi Huang, Zhuofan Wen, Hailiang Yao, Shun Chen, Longjiang Yang, Cong Yu, Fengyu Zhang, Ran Liu, and Bin Liu

TL;DR
This paper presents a multimodal emotion regression framework that combines feature concatenation, multi-objective optimization, and a VAD-inspired acoustic prior, achieving state-of-the-art results on the ABAW EMI challenge.
Contribution
It introduces a systematic multimodal fusion approach with a VAD-aware acoustic model and multi-objective training, advancing emotion regression methods.
Findings
Feature concatenation outperforms complex fusion strategies.
The proposed method achieves a mean Pearson correlation of 0.479.
Multi-objective optimization improves training stability and performance.
Abstract
We participated in the 10th ABAW Challenge, focusing on the Emotional Mimicry Intensity (EMI) Estimation track on the Hume-Vidmimic2 dataset. This task aims to predict six continuous emotion dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy. Through systematic multimodal exploration of pretrained high-level features, we found that, under our pretrained feature setting, direct feature concatenation outperformed the more complex fusion strategies we tested. This empirical finding motivated us to design a systematic approach built upon three core principles: (i) preserving modality-specific attributes through feature-level concatenation; (ii) improving training stability and metric alignment via multi-objective optimization; and (iii) enriching acoustic representations with a VAD-inspired latent prior. Our final framework integrates concatenation-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Music and Audio Processing · Vehicle Noise and Vibration Control
