Distance-aware Soft Prompt Learning for Multimodal Valence-Arousal Estimation

Byeongjin Jung; Chanyeong Park; Sejoon Lim

arXiv:2603.13415·cs.CV·March 17, 2026

Distance-aware Soft Prompt Learning for Multimodal Valence-Arousal Estimation

Byeongjin Jung, Chanyeong Park, Sejoon Lim

PDF

Open Access

TL;DR

This paper introduces a novel multimodal framework utilizing Distance-aware Soft Prompt Learning with Gaussian-based soft labels for improved valence-arousal estimation in naturalistic environments, leveraging CLIP and audio features.

Contribution

It proposes a new soft prompt learning method with Gaussian soft labels for continuous VA estimation, integrating multimodal features via hierarchical fusion and temporal modeling.

Findings

01

Achieves state-of-the-art accuracy on Aff-Wild2 dataset.

02

Effectively models fine-grained emotional transitions.

03

Enhances continuous VA estimation in unconstrained scenarios.

Abstract

Valence-arousal (VA) estimation is crucial for capturing the nuanced nature of human emotions in naturalistic environments. While pre-trained Vision-Language models like CLIP have shown remarkable semantic alignment capabilities, their application in continuous regression tasks is often limited by the discrete nature of text prompts. In this paper, we propose a novel multimodal framework for VA estimation that introduces Distance-aware Soft Prompt Learning to bridge the gap between semantic space and continuous dimensions. Specifically, we partition the VA space into a 3X3 grid, defining nine emotional regions, each associated with distinct textual descriptions. Rather than a hard categorization, we employ a Gaussian kernel to compute soft labels based on the Euclidean distance between the ground truth coordinates and the region centers, allowing the model to learn fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech and Audio Processing