From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition

Yu Liu; Leyuan Qu; Hanlei Shi; Di Gao; Yuhua Zheng; Taihao Li

arXiv:2507.11892·cs.CV·July 17, 2025

From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition

Yu Liu, Leyuan Qu, Hanlei Shi, Di Gao, Yuhua Zheng, Taihao Li

PDF

Open Access

TL;DR

This paper introduces GRACE, a novel cross-modal alignment framework that enhances dynamic emotion recognition by refining semantic descriptions and filtering irrelevant facial motions, achieving state-of-the-art results.

Contribution

The paper presents a new method combining motion modeling, semantic text refinement, and token-level alignment to improve emotion recognition accuracy.

Findings

01

Significant performance improvements on benchmark datasets.

02

Effective filtering of irrelevant facial dynamics.

03

State-of-the-art accuracy in challenging emotion recognition scenarios.

Abstract

Dynamic Facial Expression Recognition (DFER) aims to identify human emotions from temporally evolving facial movements and plays a critical role in affective computing. While recent vision-language approaches have introduced semantic textual descriptions to guide expression recognition, existing methods still face two key limitations: they often underutilize the subtle emotional cues embedded in generated text, and they have yet to incorporate sufficiently effective mechanisms for filtering out facial dynamics that are irrelevant to emotional expression. To address these gaps, We propose GRACE, Granular Representation Alignment for Cross-modal Emotion recognition that integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient spatiotemporal features. Our method constructs emotion-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Color perception and design