UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

Chuang Chen; Xiao Sun; Zhi Liu

arXiv:2409.18877·cs.AI·July 21, 2025

UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

Chuang Chen, Xiao Sun, Zhi Liu

PDF

Open Access 1 Repo

TL;DR

UniEmoX is a novel cross-modal pretraining framework that combines psychological insights with contrastive learning and masked image modeling to improve universal scene emotion perception across diverse scenarios.

Contribution

It introduces the first large-scale pretraining method integrating psychological theories with contrastive learning for emotion analysis in varied visual contexts.

Findings

01

Effective across six benchmark datasets

02

Outperforms existing emotion analysis methods

03

Demonstrates strong generalization to diverse scenarios

Abstract

Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chincharles/u-emo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Emotion and Mood Recognition · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training · Contrastive Learning