CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning

Jiacheng Shi; Yanfu Zhang; Ye Gao

arXiv:2507.04048·cs.SD·July 8, 2025

CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning

Jiacheng Shi, Yanfu Zhang, Ye Gao

PDF

TL;DR

This paper introduces CLEP-DG, a novel framework that enhances speech emotion recognition by fine-tuning multimodal models with prompt tuning and cross-modal transfer, achieving state-of-the-art results across multiple datasets.

Contribution

The paper proposes CLEP-DG, combining emotion-specific fine-tuning of CLAP with acoustic context prompt tuning and cross-modal transfer to improve domain generalization in SER.

Findings

01

Outperforms prior CLAP-based methods on five benchmark datasets.

02

Achieves state-of-the-art performance in supervised and domain generalization tasks.

03

Effectively models diverse acoustic environments without extra labeled audio.

Abstract

Speech Emotion Recognition (SER) is fundamental to affective computing and human-computer interaction, yet existing models struggle to generalize across diverse acoustic conditions. While Contrastive Language-Audio Pretraining (CLAP) provides strong multimodal alignment, it lacks dedicated mechanisms for capturing emotional cues, making it suboptimal for SER. To address this, we propose CLEP-DG, a framework that enhances CLAP's robustness in emotion recognition. First, we fine-tune CLAP to obtain CLEP, adapting it on large-scale emotional speech datasets to better encode emotion-relevant features. Then, we introduce Acoustic Context Prompt Tuning (ACPT), a text-driven augmentation strategy that optimizes learnable prompt vectors to model diverse acoustic environments without additional labeled audio. Finally, leveraging cross-modal transferability, we train a classifier on text-derived…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.