Hierarchical Audio-Visual Information Fusion with Multi-label Joint   Decoding for MER 2023

Haotian Wang; Yuxuan Xi; Hang Chen; Jun Du; Yan Song; Qing Wang,; Hengshun Zhou; Chenxi Wang; Jiefeng Ma; Pengfei Hu; Ya Jiang; Shi Cheng; Jie; Zhang; Yuzhe Weng

arXiv:2309.07925·eess.AS·September 18, 2023

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

Haotian Wang, Yuxuan Xi, Hang Chen, Jun Du, Yan Song, Qing Wang,, Hengshun Zhou, Chenxi Wang, Jiefeng Ma, Pengfei Hu, Ya Jiang, Shi Cheng, Jie, Zhang, Yuzhe Weng

PDF

TL;DR

This paper introduces a multi-structure deep learning framework for multimodal emotion recognition that fuses audio-visual features and jointly decodes discrete and dimensional emotions, achieving state-of-the-art results on MER 2023.

Contribution

It presents a novel hierarchical fusion and joint decoding approach with multi-task learning for improved multimodal emotion recognition.

Findings

01

Achieves third place on MER-MULTI leaderboard

02

Improves emotion classification accuracy

03

Enhances valence regression performance

Abstract

In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for emotion classification and valence regression in the decoding stage. A multi-task loss based on uncertainty is also designed to optimize the whole process. Finally, by combining three different structures on the posterior probability level, we obtain the final predictions of discrete and dimensional emotions. When tested on the dataset of multimodal emotion recognition challenge (MER 2023), the proposed framework yields consistent improvements in both emotion classification and valence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.