Multimodal Emotion Recognition with Vision-language Prompting and   Modality Dropout

Anbin QI; Zhongliang Liu; Xinyong Zhou; Jinba Xiao; Fengrun Zhang; Qi; Gan; Ming Tao; Gaozheng Zhang; and Lu Zhang

arXiv:2409.07078·cs.CV·September 12, 2024

Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

Anbin QI, Zhongliang Liu, Xinyong Zhou, Jinba Xiao, Fengrun Zhang, Qi, Gan, Ming Tao, Gaozheng Zhang, and Lu Zhang

PDF

Open Access

TL;DR

This paper introduces EmoVCLIP, a vision-language prompt-based model with modality dropout and self-training, achieving top accuracy in multimodal emotion recognition by enhancing robustness and leveraging unlabeled data.

Contribution

The paper presents EmoVCLIP, a novel multimodal emotion recognition model using prompt learning and modality dropout, along with a self-training strategy for unlabeled videos, setting new state-of-the-art results.

Findings

01

Achieved 90.15% accuracy on MER2024-SEMI test set.

02

Ranked 1st in the MER2024-SEMI challenge.

03

Demonstrated effectiveness of modality dropout and prompt learning.

Abstract

In this paper, we present our solution for the Second Multimodal Emotion Recognition Challenge Track 1(MER2024-SEMI). To enhance the accuracy and generalization performance of emotion recognition, we propose several methods for Multimodal Emotion Recognition. Firstly, we introduce EmoVCLIP, a model fine-tuned based on CLIP using vision-language prompt learning, designed for video-based emotion recognition tasks. By leveraging prompt learning on CLIP, EmoVCLIP improves the performance of pre-trained CLIP on emotional videos. Additionally, to address the issue of modality dependence in multimodal fusion, we employ modality dropout for robust information fusion. Furthermore, to aid Baichuan in better extracting emotional information, we suggest using GPT-4 as the prompt for Baichuan. Lastly, we utilize a self-training strategy to leverage unlabeled videos. In this process, we use unlabeled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition

MethodsByte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer · Multi-Head Attention