Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition

Xiwen Luo; Jia Li; Rencheng Song; Yu Liu; and Juan Cheng

arXiv:2605.05694·cs.CV·May 8, 2026

Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition

Xiwen Luo, Jia Li, Rencheng Song, Yu Liu, and Juan Cheng

PDF

TL;DR

This paper introduces a novel cross-modal prompt-tuning framework that fuses facial and physiological signals for emotion recognition, enhancing generalization across subjects while preserving pretrained facial representations.

Contribution

It proposes a subject-invariant fusion approach with a decoupled shared-specific adapter, improving cross-subject generalization in video-based emotion recognition.

Findings

01

Outperforms strong baselines on MAHNOB-HCI and DEAP datasets.

02

Effectively separates subject-shared and subject-specific features.

03

Enhances recognition accuracy and generalization ability.

Abstract

Emotion recognition from facial videos enables non-contact inference of human emotional states. Although facial expressions are widely used cues, they cannot fully reflect intrinsic affective states. Remote photoplethysmography (rPPG) provides complementary physiological information, but it is highly susceptible to noise and inter-subject variability, limiting generalization to unseen individuals. Existing multimodal methods combine facial and rPPG features, yet their fusion strategies often disrupt pretrained facial representations and lack explicit mechanisms to suppress subject-specific variations. To address these issues, we propose a subject-invariant cross-modal prompt-tuning framework for video-based emotion recognition. Specifically, rPPG waveforms are transformed into noise-robust time-frequency representations (TFRs), from which modality-complementary prompts are generated to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.