CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning
Jiacheng Shi, Yanfu Zhang, Ye Gao

TL;DR
This paper introduces CLEP-DG, a novel framework that enhances speech emotion recognition by fine-tuning multimodal models with prompt tuning and cross-modal transfer, achieving state-of-the-art results across multiple datasets.
Contribution
The paper proposes CLEP-DG, combining emotion-specific fine-tuning of CLAP with acoustic context prompt tuning and cross-modal transfer to improve domain generalization in SER.
Findings
Outperforms prior CLAP-based methods on five benchmark datasets.
Achieves state-of-the-art performance in supervised and domain generalization tasks.
Effectively models diverse acoustic environments without extra labeled audio.
Abstract
Speech Emotion Recognition (SER) is fundamental to affective computing and human-computer interaction, yet existing models struggle to generalize across diverse acoustic conditions. While Contrastive Language-Audio Pretraining (CLAP) provides strong multimodal alignment, it lacks dedicated mechanisms for capturing emotional cues, making it suboptimal for SER. To address this, we propose CLEP-DG, a framework that enhances CLAP's robustness in emotion recognition. First, we fine-tune CLAP to obtain CLEP, adapting it on large-scale emotional speech datasets to better encode emotion-relevant features. Then, we introduce Acoustic Context Prompt Tuning (ACPT), a text-driven augmentation strategy that optimizes learnable prompt vectors to model diverse acoustic environments without additional labeled audio. Finally, leveraging cross-modal transferability, we train a classifier on text-derived…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
