A Hands-on Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation
Martin Strauss, Jouni Paulus, Matteo Torcoli, Bernd Edler

TL;DR
This study compares state-of-the-art music source separation DNNs for dialog separation in broadcast audio, demonstrating that transfer learning and fine-tuning improve their performance to approach task-specific models.
Contribution
It provides a practical evaluation of transfer learning from music source separation models to dialog separation, highlighting their potential and limitations.
Findings
Pre-trained models can be adapted for dialog separation.
Fine-tuning improves model performance significantly.
Models reach near task-specific performance after fine-tuning.
Abstract
This paper describes a hands-on comparison on using state-of-the-art music source separation deep neural networks (DNNs) before and after task-specific fine-tuning for separating speech content from non-speech content in broadcast audio (i.e., dialog separation). The music separation models are selected as they share the number of channels (2) and sampling rate (44.1 kHz or higher) with the considered broadcast content, and vocals separation in music is considered as a parallel for dialog separation in the target application domain. These similarities are assumed to enable transfer learning between the tasks. Three models pre-trained on music (Open-Unmix, Spleeter, and Conv-TasNet) are considered in the experiments, and fine-tuned with real broadcast data. The performance of the models is evaluated before and after fine-tuning with computational evaluation metrics (SI-SIRi, SI-SDRi,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
