Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model
Mengying Ge, Dongkai Tang, Mingyang Li

TL;DR
This paper presents a novel approach using multimodal large language models to generate open-vocabulary emotion labels from videos, enabling detailed emotion recognition in complex scenes beyond fixed labels.
Contribution
It introduces a framework leveraging MLLMs for open-vocabulary emotion labeling, including data processing, training, and multi-model judgment, advancing emotion recognition in videos.
Findings
Achieved significant advantages in MER-OV challenge
Superior capabilities in complex emotion computation
Effective open-vocabulary emotion recognition
Abstract
Multimodal emotion recognition is a task of great concern. However, traditional data sets are based on fixed labels, resulting in models that often focus on main emotions and ignore detailed emotional changes in complex scenes. This report introduces the solution of using MLLMs technology to generate open-vocabulary emotion labels from a video. The solution includes the use of framework, data generation and processing, training methods, results generation and multi-model co-judgment. In the MER-OV (Open-Word Emotion Recognition) of the MER2024 challenge, our method achieved significant advantages, leading to its superior capabilities in complex emotion computation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Technology and Pedagogy · Sentiment Analysis and Opinion Mining
MethodsFocus
