Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

Junqi Zhao; Xubo Liu; Jinzheng Zhao; Yi Yuan; Qiuqiang Kong; Mark D.; Plumbley; Wenwu Wang

arXiv:2407.11745·eess.AS·November 7, 2024

Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D., Plumbley, Wenwu Wang

PDF

Open Access

TL;DR

This paper introduces a novel approach that integrates self-supervised audio masked autoencoder embeddings into universal sound separation models, significantly improving their performance on complex sound mixtures.

Contribution

It is the first to incorporate self-supervised pre-trained models like A-MAE into USS systems, demonstrating enhanced separation capabilities.

Findings

01

SSL embeddings improve separation accuracy

02

Freezing or updating A-MAE parameters affects performance

03

Enhanced results on AudioSet dataset

Abstract

Universal sound separation (USS) is a task of separating mixtures of arbitrary sound sources. Typically, universal separation models are trained from scratch in a supervised manner, using labeled data. Self-supervised learning (SSL) is an emerging deep learning approach that leverages unlabeled data to obtain task-agnostic representations, which can benefit many downstream tasks. In this paper, we propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system to enhance its separation performance. We employ two strategies to utilize SSL embeddings: freezing or updating the parameters of A-MAE during fine-tuning. The SSL embeddings are concatenated with the short-time Fourier transform (STFT) to serve as input features for the separation model. We evaluate our methods on the AudioSet dataset, and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis