Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023
Haotian Wang, Yuxuan Xi, Hang Chen, Jun Du, Yan Song, Qing Wang,, Hengshun Zhou, Chenxi Wang, Jiefeng Ma, Pengfei Hu, Ya Jiang, Shi Cheng, Jie, Zhang, Yuzhe Weng

TL;DR
This paper introduces a multi-structure deep learning framework for multimodal emotion recognition that fuses audio-visual features and jointly decodes discrete and dimensional emotions, achieving state-of-the-art results on MER 2023.
Contribution
It presents a novel hierarchical fusion and joint decoding approach with multi-task learning for improved multimodal emotion recognition.
Findings
Achieves third place on MER-MULTI leaderboard
Improves emotion classification accuracy
Enhances valence regression performance
Abstract
In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for emotion classification and valence regression in the decoding stage. A multi-task loss based on uncertainty is also designed to optimize the whole process. Finally, by combining three different structures on the posterior probability level, we obtain the final predictions of discrete and dimensional emotions. When tested on the dataset of multimodal emotion recognition challenge (MER 2023), the proposed framework yields consistent improvements in both emotion classification and valence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
