Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style

Wonjune Kang; Deb Roy

arXiv:2508.11187·eess.AS·August 18, 2025

Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style

Wonjune Kang, Deb Roy

PDF

TL;DR

This paper presents a novel task of expressive speech retrieval using natural language descriptions of speaking styles, enabling retrieval based on how speech was spoken rather than content.

Contribution

It introduces a joint embedding framework for speech and text to facilitate style-based speech retrieval with free-form text prompts, advancing beyond prior content-based methods.

Findings

01

Achieves strong retrieval performance on multiple datasets

02

Effective cross-modal alignment with encoder architecture choices

03

Prompt augmentation improves generalization to new style descriptions

Abstract

We introduce the task of expressive speech retrieval, where the goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style. While prior work has primarily focused on performing speech retrieval based on what was said in an utterance, we aim to do so based on how something was said. We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space, which enables using free-form text prompts describing emotions or styles as queries to retrieve matching expressive speech segments. We perform detailed analyses of various aspects of our proposed framework, including encoder architectures, training criteria for effective cross-modal alignment, and prompt augmentation for improved generalization to arbitrary text queries. Experiments on multiple datasets encompassing 22 speaking styles…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.