SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer
Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali,, Najim Dehak

TL;DR
SoloAudio is a new diffusion-based model for target sound extraction that uses a Transformer on latent features and incorporates language-oriented features, achieving state-of-the-art results and strong generalization to unseen sounds.
Contribution
It introduces a Transformer-based latent diffusion model with language-oriented features for target sound extraction, enhancing generalization and zero-shot capabilities.
Findings
Achieves state-of-the-art results on FSD Kaggle and AudioSet datasets.
Demonstrates strong zero-shot and few-shot sound extraction performance.
Shows improved generalization to out-of-domain and unseen sound events.
Abstract
In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsLinear Layer · Multi-Head Attention · Convolution · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Attention Is All You Need
