Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training
Aswin Sivaraman, Scott Wisdom, Hakan Erdogan, John R. Hershey

TL;DR
This paper demonstrates that unsupervised MixIT training can effectively adapt speech separation models to real-world, noisy, and reverberant meeting recordings, improving performance over generalist models.
Contribution
It introduces a method for adapting speech separation models to real-world data using MixIT, without requiring ground-truth references, and evaluates the adaptation with novel subjective and objective metrics.
Findings
Semi-supervised fine-tuning improves SI-SNR and PESQ scores.
Models outperform unadapted generalist models on real AMI recordings.
Proposed MUSHIRA protocol effectively evaluates models with imperfect references.
Abstract
The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus. The models are tested on real AMI recordings containing overlapping speech, and are evaluated subjectively by human listeners. To objectively evaluate our models, we also devise a synthetic AMI test set. For human evaluations on real recordings, we also propose a modification of the standard MUSHRA protocol to handle imperfect reference signals, which we call MUSHIRA. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
