TL;DR
This paper introduces CLIP-AUTT, a method that personalizes fine-grained video emotion recognition by dynamically adapting Action Unit prompts at test time, improving accuracy and robustness.
Contribution
It proposes a novel test-time personalization approach using AU prompts within CLIP, enhancing subject-specific emotion recognition without retraining the model.
Findings
CLIP-AUTT outperforms state-of-the-art methods on three datasets.
The approach effectively adapts to unseen subjects in video emotion recognition.
AU-based prompts improve interpretability and fine-grained recognition.
Abstract
Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP's contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
