RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

Haoqin Sun; Jingguang Tian; Jiaming Zhou; Hui Wang; Jiabei He; Shiwan Zhao; Xiangyu Kong; Desheng Hu; Xinkang Xu; Xinhui Hu; Yong Qin

arXiv:2505.19437·cs.SD·May 27, 2025

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

Haoqin Sun, Jingguang Tian, Jiaming Zhou, Hui Wang, Jiabei He, Shiwan Zhao, Xiangyu Kong, Desheng Hu, Xinkang Xu, Xinhui Hu, Yong Qin

PDF

Open Access

TL;DR

This paper introduces RA-CLAP, an advanced speech retrieval model that enhances emotional speaking style description by learning nuanced relationships between speech and language, outperforming traditional binary-matching methods.

Contribution

It proposes RA-CLAP, a relation-augmented contrastive learning model that captures local speech-description relationships, advancing emotional speaking style retrieval.

Findings

01

RA-CLAP outperforms baseline models in ESSR tasks.

02

Self-distillation improves model generalization.

03

Enhanced understanding of speech-language relationships in emotional styles.

Abstract

The Contrastive Language-Audio Pretraining (CLAP) model has demonstrated excellent performance in general audio description-related tasks, such as audio retrieval. However, in the emerging field of emotional speaking style description (ESSD), cross-modal contrastive pretraining remains largely unexplored. In this paper, we propose a novel speech retrieval task called emotional speaking style retrieval (ESSR), and ESS-CLAP, an emotional speaking style CLAP model tailored for learning relationship between speech and natural language descriptions. In addition, we further propose relation-augmented CLAP (RA-CLAP) to address the limitation of traditional methods that assume a strict binary relationship between caption and audio. The model leverages self-distillation to learn the potential local matching relationships between speech and descriptions, thereby enhancing generalization ability.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis