Universal Sound Separation with Self-Supervised Audio Masked Autoencoder
Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D., Plumbley, Wenwu Wang

TL;DR
This paper introduces a novel approach that integrates self-supervised audio masked autoencoder embeddings into universal sound separation models, significantly improving their performance on complex sound mixtures.
Contribution
It is the first to incorporate self-supervised pre-trained models like A-MAE into USS systems, demonstrating enhanced separation capabilities.
Findings
SSL embeddings improve separation accuracy
Freezing or updating A-MAE parameters affects performance
Enhanced results on AudioSet dataset
Abstract
Universal sound separation (USS) is a task of separating mixtures of arbitrary sound sources. Typically, universal separation models are trained from scratch in a supervised manner, using labeled data. Self-supervised learning (SSL) is an emerging deep learning approach that leverages unlabeled data to obtain task-agnostic representations, which can benefit many downstream tasks. In this paper, we propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system to enhance its separation performance. We employ two strategies to utilize SSL embeddings: freezing or updating the parameters of A-MAE during fine-tuning. The SSL embeddings are concatenated with the short-time Fourier transform (STFT) to serve as input features for the separation model. We evaluate our methods on the AudioSet dataset, and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
