TL;DR
Emotion-Qwen is a unified multimodal framework that enhances emotion understanding in videos by integrating a novel MoE-based architecture, a structured pre-training pipeline, and a large-scale emotion dataset, achieving state-of-the-art results.
Contribution
It introduces a hybrid MoE architecture and a three-stage pre-training pipeline for improved emotion and vision understanding in multimodal models.
Findings
Achieves state-of-the-art performance on emotion recognition benchmarks.
Maintains strong performance on general vision-language tasks.
Develops the large-scale Video Emotion Reasoning dataset with 40K clips.
Abstract
Accurate emotion understanding in videos necessitates effectively recognizing and interpreting emotional states by integrating visual, textual, auditory, and contextual cues. Although recent Large Multimodal Models (LMMs) have exhibited significant progress in general vision-language (VL) tasks, their performance often deteriorates in emotion-specific scenarios, exhibiting catastrophic forgetting when fine-tuned on emotion-centric tasks. To overcome these limitations, we propose Emotion-Qwen, a unified multimodal framework designed to simultaneously enable robust emotion understanding and preserve general VL reasoning capabilities. Emotion-Qwen introduces a novel Hybrid Compressor based on a Mixture-of-Experts (MoE) architecture, dynamically routing inputs to optimally balance emotion-specific processing and general multimodal reasoning. We further propose a carefully structured…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
