SoloAudio: Target Sound Extraction with Language-oriented Audio   Diffusion Transformer

Helin Wang; Jiarui Hai; Yen-Ju Lu; Karan Thakkar; Mounya Elhilali,; Najim Dehak

arXiv:2409.08425·eess.AS·January 3, 2025

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali,, Najim Dehak

PDF

Open Access 1 Repo 1 Models

TL;DR

SoloAudio is a new diffusion-based model for target sound extraction that uses a Transformer on latent features and incorporates language-oriented features, achieving state-of-the-art results and strong generalization to unseen sounds.

Contribution

It introduces a Transformer-based latent diffusion model with language-oriented features for target sound extraction, enhancing generalization and zero-shot capabilities.

Findings

01

Achieves state-of-the-art results on FSD Kaggle and AudioSet datasets.

02

Demonstrates strong zero-shot and few-shot sound extraction performance.

03

Shows improved generalization to out-of-domain and unseen sound events.

Abstract

In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wanghelin1997/soloaudio
jaxOfficial

Models

🤗
westbrook/SoloAudio
model· 13 dl· ♡ 2
13 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsLinear Layer · Multi-Head Attention · Convolution · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Attention Is All You Need