Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training

Jianyuan Feng; Guangzheng Li; Yangfei Xu

arXiv:2506.16833·cs.SD·June 23, 2025

Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training

Jianyuan Feng, Guangzheng Li, Yangfei Xu

PDF

Open Access

TL;DR

HybridSep is a novel two-stage language-queried audio separation framework that combines pre-trained SSL audio models with CLAP-based semantic embeddings, using adversarial diffusion training to improve separation accuracy and set new benchmarks.

Contribution

It introduces HybridSep, integrating SSL and CLAP with adversarial diffusion training for improved language-queried audio separation.

Findings

01

Significant performance improvements over state-of-the-art methods.

02

Establishment of new benchmarks for language-queried audio separation.

03

Effective use of adversarial diffusion training to enhance separation fidelity.

Abstract

Language-queried Audio Separation (LASS) employs linguistic queries to isolate target sounds based on semantic descriptions. However, existing methods face challenges in aligning complex auditory features with linguistic context while preserving separation precision. Current research efforts focus primarily on text description augmentation and architectural innovations, yet the potential of integrating pre-trained self-supervised learning (SSL) audio models and Contrastive Language-Audio Pretraining (CLAP) frameworks, capable of extracting cross-modal audio-text relationships, remains underexplored. To address this, we present HybridSep, a two-stage LASS framework that synergizes SSL-based acoustic representations with CLAP-derived semantic embeddings. Our framework introduces Adversarial Consistent Training (ACT), a novel optimization strategy that treats diffusion as an auxiliary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis