Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding

Dawei Huang; Qing Li; Chuan Yan; Zebang Cheng; Zihao Han; Yurong Huang; Xiang Li; Bin Li; Xiaohui Wang; Zheng Lian; Zhi-Qi Cheng; Xiaojiang Peng

arXiv:2505.06685·cs.MM·August 14, 2025

Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding

Dawei Huang, Qing Li, Chuan Yan, Zebang Cheng, Zihao Han, Yurong Huang, Xiang Li, Bin Li, Xiaohui Wang, Zheng Lian, Zhi-Qi Cheng, Xiaojiang Peng

PDF

1 Repo

TL;DR

Emotion-Qwen is a unified multimodal framework that enhances emotion understanding in videos by integrating a novel MoE-based architecture, a structured pre-training pipeline, and a large-scale emotion dataset, achieving state-of-the-art results.

Contribution

It introduces a hybrid MoE architecture and a three-stage pre-training pipeline for improved emotion and vision understanding in multimodal models.

Findings

01

Achieves state-of-the-art performance on emotion recognition benchmarks.

02

Maintains strong performance on general vision-language tasks.

03

Develops the large-scale Video Emotion Reasoning dataset with 40K clips.

Abstract

Accurate emotion understanding in videos necessitates effectively recognizing and interpreting emotional states by integrating visual, textual, auditory, and contextual cues. Although recent Large Multimodal Models (LMMs) have exhibited significant progress in general vision-language (VL) tasks, their performance often deteriorates in emotion-specific scenarios, exhibiting catastrophic forgetting when fine-tuned on emotion-centric tasks. To overcome these limitations, we propose Emotion-Qwen, a unified multimodal framework designed to simultaneously enable robust emotion understanding and preserve general VL reasoning capabilities. Emotion-Qwen introduces a novel Hybrid Compressor based on a Mixture-of-Experts (MoE) architecture, dynamically routing inputs to optimally balance emotion-specific processing and general multimodal reasoning. We further propose a carefully structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

24davidhuang/emotion-qwen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.