ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining
Anuj Diwan, Eunsol Choi, David Harwath

TL;DR
ParaSpeechCLAP is a dual-encoder contrastive model that maps speech and text style captions into a shared space, enabling advanced style understanding and manipulation in speech applications.
Contribution
It introduces specialized and unified models for style embedding, demonstrating improved performance across style retrieval, classification, and TTS enhancement.
Findings
Specialized models perform better on individual style dimensions.
Unified model excels in compositional style evaluation.
Models outperform baselines on most metrics in style tasks.
Abstract
We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models' performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- ajd12342/paraspeechcaps-intrinsic-traindataset· 241 dl241 dl
- ajd12342/paraspeechcaps-situational-traindataset· 204 dl204 dl
- ajd12342/paraspeechclap-eval-intrinsicdataset· 235 dl235 dl
- ajd12342/paraspeechclap-eval-situationaldataset· 131 dl131 dl
- ajd12342/paraspeechclap-eval-combineddataset· 176 dl176 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
