Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks
Darius Petermann, Gordon Wichern, Aswin Shanmugam Subramanian,, Zhong-Qiu Wang, Jonathan Le Roux

TL;DR
This paper addresses the cocktail fork problem by separating audio into speech, music, and sound effects, benchmarking deep learning models, and exploring how remixing separated sources can enhance transcription accuracy.
Contribution
It introduces a three-pronged source separation approach for complex audio scenes and evaluates how remixing separated sources improves downstream transcription tasks.
Findings
Source separation improves transcription performance over original soundtracks.
Remixing sources at 17.5 dB SNR reduces word error rate in speech recognition.
Remixing enhances tagging accuracy for music and sound effects.
Abstract
Emulating the human ability to solve the cocktail party problem, i.e., focus on a source of interest in a complex acoustic scene, is a long standing goal of audio source separation research. Much of this research investigates separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. In this paper, we focus on the cocktail fork problem, which takes a three-pronged approach to source separation by separating an audio mixture such as a movie soundtrack or podcast into the three broad categories of speech, music, and sound effects (SFX - understood to include ambient noise and natural sound events). We benchmark the performance of several deep learning-based source separation models on this task and evaluate them with respect to simple objective measures such as signal-to-distortion ratio (SDR) as well as objective metrics that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
