Text-Queried Audio Source Separation via Hierarchical Modeling

Xinlei Yin; Xiulian Peng; Xue Jiang; Zhiwei Xiong; Yan Lu

arXiv:2505.21025·cs.SD·December 3, 2025

Text-Queried Audio Source Separation via Hierarchical Modeling

Xinlei Yin, Xiulian Peng, Xue Jiang, Zhiwei Xiong, Yan Lu

PDF

Open Access

TL;DR

This paper introduces a hierarchical framework for text-queried audio source separation that effectively models semantic alignment and structure preservation, achieving state-of-the-art results with less training data.

Contribution

The proposed HSM-TSS framework decouples semantic separation into global and local stages, improving efficiency and accuracy in text-guided audio source separation.

Findings

01

Achieves state-of-the-art separation performance.

02

Requires less training data than existing methods.

03

Maintains high semantic consistency in complex scenes.

Abstract

Target audio source separation with natural language queries presents a promising paradigm for extracting arbitrary audio events through arbitrary text descriptions. Existing methods mainly face two challenges, the difficulty in jointly modeling acoustic-textual alignment and semantic-aware separation within a blindly-learned single-stage architecture, and the reliance on large-scale accurately-labeled training data to compensate for inefficient cross-modal learning and separation. To address these challenges, we propose a hierarchical decomposition framework, HSM-TSS, that decouples the task into global-local semantic-guided feature separation and structure-preserving acoustic reconstruction. Our approach introduces a dual-stage mechanism for semantic separation, operating on distinct global and local semantic feature spaces. We first perform global-semantic separation through a global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsALIGN