Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling   for Multimodal Emotion Analysis

Qize Yang; Detao Bai; Yi-Xing Peng; Xihan Wei

arXiv:2501.09502·cs.CV·January 17, 2025

Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis

Qize Yang, Detao Bai, Yi-Xing Peng, Xihan Wei

PDF

Open Access

TL;DR

This paper introduces Omni-Emotion, a multimodal emotion analysis framework that enhances video large language models by integrating detailed facial and audio cues, supported by new datasets, achieving state-of-the-art results.

Contribution

The paper presents a novel integration of facial encoding models into video MLLMs and introduces new datasets with detailed emotion annotations for improved multimodal emotion understanding.

Findings

01

Achieved state-of-the-art performance in emotion recognition and reasoning.

02

Developed new datasets with over 24,000 samples for training and 3,500 detailed annotations.

03

Effectively unified audio and facial cues within a shared feature space.

Abstract

Understanding emotions accurately is essential for fields like human-computer interaction. Due to the complexity of emotions and their multi-modal nature (e.g., emotions are influenced by facial expressions and audio), researchers have turned to using multi-modal models to understand human emotions rather than single-modality. However, current video multi-modal large language models (MLLMs) encounter difficulties in effectively integrating audio and identifying subtle facial micro-expressions. Furthermore, the lack of detailed emotion analysis datasets also limits the development of multimodal emotion analysis. To address these issues, we introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations, respectively. These datasets allow models to learn from diverse scenarios…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis