Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition
Xiwen Luo, Jia Li, Rencheng Song, Yu Liu, and Juan Cheng

TL;DR
This paper introduces a novel cross-modal prompt-tuning framework that fuses facial and physiological signals for emotion recognition, enhancing generalization across subjects while preserving pretrained facial representations.
Contribution
It proposes a subject-invariant fusion approach with a decoupled shared-specific adapter, improving cross-subject generalization in video-based emotion recognition.
Findings
Outperforms strong baselines on MAHNOB-HCI and DEAP datasets.
Effectively separates subject-shared and subject-specific features.
Enhances recognition accuracy and generalization ability.
Abstract
Emotion recognition from facial videos enables non-contact inference of human emotional states. Although facial expressions are widely used cues, they cannot fully reflect intrinsic affective states. Remote photoplethysmography (rPPG) provides complementary physiological information, but it is highly susceptible to noise and inter-subject variability, limiting generalization to unseen individuals. Existing multimodal methods combine facial and rPPG features, yet their fusion strategies often disrupt pretrained facial representations and lack explicit mechanisms to suppress subject-specific variations. To address these issues, we propose a subject-invariant cross-modal prompt-tuning framework for video-based emotion recognition. Specifically, rPPG waveforms are transformed into noise-robust time-frequency representations (TFRs), from which modality-complementary prompts are generated to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
