The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World   Soundtracks

Darius Petermann; Gordon Wichern; Zhong-Qiu Wang; Jonathan Le Roux

arXiv:2110.09958·eess.AS·March 25, 2022

The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

Darius Petermann, Gordon Wichern, Zhong-Qiu Wang, Jonathan Le Roux

PDF

Open Access 3 Repos

TL;DR

This paper introduces the cocktail fork problem, aiming to separate audio into speech, music, and sound effects, and provides a new dataset and model benchmarks for this underexplored task.

Contribution

It formalizes the cocktail fork problem, creates the Divide and Remaster dataset, and proposes a multi-resolution model for improved three-source audio separation.

Findings

01

Best model achieves over 11 dB SI-SDR improvement for each source

02

Benchmark results establish baseline performance for the task

03

Dataset enables future research in three-source separation

Abstract

The cocktail party problem aims at isolating any source of interest within a complex acoustic scene, and has long inspired audio source separation research. Recent efforts have mainly focused on separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. However, separating an audio mixture (e.g., movie soundtrack) into the three broad categories of speech, music, and sound effects (understood to include ambient noise and natural sound events) has been left largely unexplored, despite a wide range of potential applications. This paper formalizes this task as the cocktail fork problem, and presents the Divide and Remaster (DnR) dataset to foster research on this topic. DnR is built from three well-established audio datasets (LibriSpeech, FMA, FSD50k), taking care to reproduce conditions similar to professionally produced content…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Blind Source Separation Techniques