Distance-aware Soft Prompt Learning for Multimodal Valence-Arousal Estimation
Byeongjin Jung, Chanyeong Park, Sejoon Lim

TL;DR
This paper introduces a novel multimodal framework utilizing Distance-aware Soft Prompt Learning with Gaussian-based soft labels for improved valence-arousal estimation in naturalistic environments, leveraging CLIP and audio features.
Contribution
It proposes a new soft prompt learning method with Gaussian soft labels for continuous VA estimation, integrating multimodal features via hierarchical fusion and temporal modeling.
Findings
Achieves state-of-the-art accuracy on Aff-Wild2 dataset.
Effectively models fine-grained emotional transitions.
Enhances continuous VA estimation in unconstrained scenarios.
Abstract
Valence-arousal (VA) estimation is crucial for capturing the nuanced nature of human emotions in naturalistic environments. While pre-trained Vision-Language models like CLIP have shown remarkable semantic alignment capabilities, their application in continuous regression tasks is often limited by the discrete nature of text prompts. In this paper, we propose a novel multimodal framework for VA estimation that introduces Distance-aware Soft Prompt Learning to bridge the gap between semantic space and continuous dimensions. Specifically, we partition the VA space into a 3X3 grid, defining nine emotional regions, each associated with distinct textual descriptions. Rather than a hard categorization, we employ a Gaussian kernel to compute soft labels based on the Euclidean distance between the ground truth coordinates and the region centers, allowing the model to learn fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech and Audio Processing
