Leveraging Audio-Only Data for Text-Queried Target Sound Extraction

Kohei Saijo; Janek Ebbers; Fran\c{c}ois G. Germain; Sameer Khurana,; Gordon Wichern; Jonathan Le Roux

arXiv:2409.13152·eess.AS·September 23, 2024

Leveraging Audio-Only Data for Text-Queried Target Sound Extraction

Kohei Saijo, Janek Ebbers, Fran\c{c}ois G. Germain, Sameer Khurana,, Gordon Wichern, Jonathan Le Roux

PDF

Open Access

TL;DR

This paper presents a method to leverage large amounts of audio-only data for text-queried target sound extraction by using embedding manipulation techniques, enabling effective training without requiring paired text-audio data.

Contribution

It introduces a novel approach to utilize audio-only data with embedding dropout for training text-queried sound extraction models, reducing dependence on paired datasets.

Findings

01

Audio-only data can match the effectiveness of paired data when using embedding dropout.

02

Embedding manipulation techniques help prevent overfitting to audio queries.

03

The proposed method improves TSE performance without requiring large-scale text-audio pairs.

Abstract

The goal of text-queried target sound extraction (TSE) is to extract from a mixture a sound source specified with a natural-language caption. While it is preferable to have access to large-scale text-audio pairs to address a variety of text prompts, the limited number of available high-quality text-audio pairs hinders the data scaling. To this end, this work explores how to leverage audio-only data without any captions for the text-queried TSE task to potentially scale up the data amount. A straightforward way to do so is to use a joint audio-text embedding model, such as the contrastive language-audio pre-training (CLAP) model, as a query encoder and train a TSE model using audio embeddings obtained from the ground-truth audio. The TSE model can then accept text queries at inference time by switching to the text encoder. While this approach should work if the audio and text embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsEmbedding Dropout · Dropout