Video Emotion Open-vocabulary Recognition Based on Multimodal Large   Language Model

Mengying Ge; Dongkai Tang; Mingyang Li

arXiv:2408.11286·cs.CV·August 23, 2024

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

Mengying Ge, Dongkai Tang, Mingyang Li

PDF

Open Access

TL;DR

This paper presents a novel approach using multimodal large language models to generate open-vocabulary emotion labels from videos, enabling detailed emotion recognition in complex scenes beyond fixed labels.

Contribution

It introduces a framework leveraging MLLMs for open-vocabulary emotion labeling, including data processing, training, and multi-model judgment, advancing emotion recognition in videos.

Findings

01

Achieved significant advantages in MER-OV challenge

02

Superior capabilities in complex emotion computation

03

Effective open-vocabulary emotion recognition

Abstract

Multimodal emotion recognition is a task of great concern. However, traditional data sets are based on fixed labels, resulting in models that often focus on main emotions and ignore detailed emotional changes in complex scenes. This report introduces the solution of using MLLMs technology to generate open-vocabulary emotion labels from a video. The solution includes the use of framework, data generation and processing, training methods, results generation and multi-model co-judgment. In the MER-OV (Open-Word Emotion Recognition) of the MER2024 challenge, our method achieved significant advantages, leading to its superior capabilities in complex emotion computation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Technology and Pedagogy · Sentiment Analysis and Opinion Mining

MethodsFocus