Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track

Jiawen Huang; Chenxi Huang; Zhuofan Wen; Hailiang Yao; Shun Chen; Longjiang Yang; Cong Yu; Fengyu Zhang; Ran Liu; and Bin Liu

arXiv:2603.13760·cs.AI·March 17, 2026

Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track

Jiawen Huang, Chenxi Huang, Zhuofan Wen, Hailiang Yao, Shun Chen, Longjiang Yang, Cong Yu, Fengyu Zhang, Ran Liu, and Bin Liu

PDF

Open Access

TL;DR

This paper presents a multimodal emotion regression framework that combines feature concatenation, multi-objective optimization, and a VAD-inspired acoustic prior, achieving state-of-the-art results on the ABAW EMI challenge.

Contribution

It introduces a systematic multimodal fusion approach with a VAD-aware acoustic model and multi-objective training, advancing emotion regression methods.

Findings

01

Feature concatenation outperforms complex fusion strategies.

02

The proposed method achieves a mean Pearson correlation of 0.479.

03

Multi-objective optimization improves training stability and performance.

Abstract

We participated in the 10th ABAW Challenge, focusing on the Emotional Mimicry Intensity (EMI) Estimation track on the Hume-Vidmimic2 dataset. This task aims to predict six continuous emotion dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy. Through systematic multimodal exploration of pretrained high-level features, we found that, under our pretrained feature setting, direct feature concatenation outperformed the more complex fusion strategies we tested. This empirical finding motivated us to design a systematic approach built upon three core principles: (i) preserving modality-specific attributes through feature-level concatenation; (ii) improving training stability and metric alignment via multi-objective optimization; and (iii) enriching acoustic representations with a VAD-inspired latent prior. Our final framework integrates concatenation-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · Vehicle Noise and Vibration Control