SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model
Carlos Hernandez-Olivan, Marc Delcroix, Tsubasa Ochiai, Daisuke, Niizumi, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

TL;DR
This paper introduces a novel target sound extraction system that leverages a pre-trained audio foundation model, M2D, to improve sound identification and extraction across diverse sound types, especially when using enrollment clues.
Contribution
The paper proposes integrating the M2D foundation model into SoundBeam to enhance target sound extraction performance with minimal training from scratch.
Findings
M2D integration improves extraction accuracy.
Performance gains are notable with enrollment clues.
The system generalizes well across various sound types.
Abstract
Target sound extraction (TSE) consists of isolating a desired sound from a mixture of arbitrary sounds using clues to identify it. A TSE system requires solving two problems at once, identifying the target source and extracting the target signal from the mixture. For increased practicability, the same system should work with various types of sound. The duality of the problem and the wide variety of sounds make it challenging to train a powerful TSE system from scratch. In this paper, to tackle this problem, we explore using a pre-trained audio foundation model that can provide rich feature representations of sounds within a TSE system. We chose the masked-modeling duo (M2D) foundation model, which appears especially suited for the TSE task, as it is trained using a dual objective consisting of sound-label predictions and improved masked prediction. These objectives are related to sound…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
