Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training
Jianyuan Feng, Guangzheng Li, Yangfei Xu

TL;DR
HybridSep is a novel two-stage language-queried audio separation framework that combines pre-trained SSL audio models with CLAP-based semantic embeddings, using adversarial diffusion training to improve separation accuracy and set new benchmarks.
Contribution
It introduces HybridSep, integrating SSL and CLAP with adversarial diffusion training for improved language-queried audio separation.
Findings
Significant performance improvements over state-of-the-art methods.
Establishment of new benchmarks for language-queried audio separation.
Effective use of adversarial diffusion training to enhance separation fidelity.
Abstract
Language-queried Audio Separation (LASS) employs linguistic queries to isolate target sounds based on semantic descriptions. However, existing methods face challenges in aligning complex auditory features with linguistic context while preserving separation precision. Current research efforts focus primarily on text description augmentation and architectural innovations, yet the potential of integrating pre-trained self-supervised learning (SSL) audio models and Contrastive Language-Audio Pretraining (CLAP) frameworks, capable of extracting cross-modal audio-text relationships, remains underexplored. To address this, we present HybridSep, a two-stage LASS framework that synergizes SSL-based acoustic representations with CLAP-derived semantic embeddings. Our framework introduces Adversarial Consistent Training (ACT), a novel optimization strategy that treats diffusion as an auxiliary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
