PodcastMix: A dataset for separating music and speech in podcasts
Nicol\'as Schmidt, Jordi Pons, Marius Miron

TL;DR
PodcastMix provides a new dataset and benchmark for separating music and speech in podcasts, highlighting current deep learning models' generalization challenges and demonstrating promising separation quality.
Contribution
We introduce PodcastMix, a large dataset and benchmark for music and speech separation in podcasts, including synthetic training data and real podcast evaluation sets.
Findings
Deep learning models show generalization issues on real podcasts.
The best model achieves an overall separation quality score of 3.84.
Dataset and baselines are publicly available.
Abstract
We introduce PodcastMix, a dataset formalizing the task of separating background music and foreground speech in podcasts. We aim at defining a benchmark suitable for training and evaluating (deep learning) source separation models. To that end, we release a large and diverse training dataset based on programatically generated podcasts. However, current (deep learning) models can incur into generalization issues, specially when trained on synthetic data. To target potential generalization issues, we release an evaluation set based on real podcasts for which we design objective and subjective tests. Out of our experiments with real podcasts, we find that current (deep learning) models may have generalization issues. Yet, these can perform competently, e.g., our best baseline separates speech with a mean opinion score of 3.84 (rating "overall separation quality" from 1 to 5). The dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadio, Podcasts, and Digital Media · Music and Audio Processing · Speech and Audio Processing
