Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions

Licai Sun; Xingxun Jiang; Haoyu Chen; Yante Li; Zheng Lian; Biu Liu; Yuan Zong; Wenming Zheng; Jukka M. Lepp\"anen; Guoying Zhao

arXiv:2507.21015·cs.CV·July 29, 2025

Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions

Licai Sun, Xingxun Jiang, Haoyu Chen, Yante Li, Zheng Lian, Biu Liu, Yuan Zong, Wenming Zheng, Jukka M. Lepp\"anen, Guoying Zhao

PDF

TL;DR

This paper introduces EmoCap100K, a large-scale dataset with rich emotional captions, and EmoCapCLIP, a contrastive learning framework that leverages these captions to improve facial emotion recognition across multiple benchmarks.

Contribution

The paper presents a novel large-scale dataset and a contrastive learning method that effectively utilize semantically rich captions for facial emotion representation learning.

Findings

01

Superior performance on 20+ benchmarks

02

Effective exploitation of multi-level caption information

03

Enhanced generalization in emotion recognition

Abstract

Current facial emotion recognition systems are predominately trained to predict a fixed set of predefined categories or abstract dimensional values. This constrained form of supervision hinders generalization and applicability, as it reduces the rich and nuanced spectrum of emotions into oversimplified labels or scales. In contrast, natural language provides a more flexible, expressive, and interpretable way to represent emotions, offering a much broader source of supervision. Yet, leveraging semantically rich natural language captions as supervisory signals for facial emotion representation learning remains relatively underexplored, primarily due to two key challenges: 1) the lack of large-scale caption datasets with rich emotional semantics, and 2) the absence of effective frameworks tailored to harness such rich supervision. To this end, we introduce EmoCap100K, a large-scale facial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.