Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style
Wonjune Kang, Deb Roy

TL;DR
This paper presents a novel task of expressive speech retrieval using natural language descriptions of speaking styles, enabling retrieval based on how speech was spoken rather than content.
Contribution
It introduces a joint embedding framework for speech and text to facilitate style-based speech retrieval with free-form text prompts, advancing beyond prior content-based methods.
Findings
Achieves strong retrieval performance on multiple datasets
Effective cross-modal alignment with encoder architecture choices
Prompt augmentation improves generalization to new style descriptions
Abstract
We introduce the task of expressive speech retrieval, where the goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style. While prior work has primarily focused on performing speech retrieval based on what was said in an utterance, we aim to do so based on how something was said. We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space, which enables using free-form text prompts describing emotions or styles as queries to retrieve matching expressive speech segments. We perform detailed analyses of various aspects of our proposed framework, including encoder architectures, training criteria for effective cross-modal alignment, and prompt augmentation for improved generalization to arbitrary text queries. Experiments on multiple datasets encompassing 22 speaking styles…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
