GASS: Generalizing Audio Source Separation with Large-scale Data

Jordi Pons; Xiaoyu Liu; Santiago Pascual; Joan Serr\`a

arXiv:2310.00140·cs.SD·October 3, 2023·1 cites

GASS: Generalizing Audio Source Separation with Large-scale Data

Jordi Pons, Xiaoyu Liu, Santiago Pascual, Joan Serr\`a

PDF

Open Access

TL;DR

This paper introduces GASS, a large-scale supervised model for universal audio source separation that demonstrates strong generalization across diverse audio tasks and outperforms previous methods after fine-tuning.

Contribution

It presents a novel large-scale dataset and a unified GASS model capable of separating speech, music, and sound events, advancing universal source separation.

Findings

01

GASS achieves strong in-distribution separation results.

02

GASS generalizes well to out-of-distribution sound event and speech separation.

03

Fine-tuning GASS yields state-of-the-art performance in multiple benchmarks.

Abstract

Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset. We assess GASS models on a diverse set of tasks. Our strong in-distribution results show the feasibility of GASS models, and the competitive out-of-distribution performance in sound event and speech separation shows its generalization abilities. Yet, it is challenging for GASS models to generalize for separating out-of-distribution cinematic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsFocus