Adapting Speech Separation to Real-World Meetings Using Mixture   Invariant Training

Aswin Sivaraman; Scott Wisdom; Hakan Erdogan; John R. Hershey

arXiv:2110.10739·cs.SD·October 22, 2021

Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training

Aswin Sivaraman, Scott Wisdom, Hakan Erdogan, John R. Hershey

PDF

Open Access

TL;DR

This paper demonstrates that unsupervised MixIT training can effectively adapt speech separation models to real-world, noisy, and reverberant meeting recordings, improving performance over generalist models.

Contribution

It introduces a method for adapting speech separation models to real-world data using MixIT, without requiring ground-truth references, and evaluates the adaptation with novel subjective and objective metrics.

Findings

01

Semi-supervised fine-tuning improves SI-SNR and PESQ scores.

02

Models outperform unadapted generalist models on real AMI recordings.

03

Proposed MUSHIRA protocol effectively evaluates models with imperfect references.

Abstract

The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus. The models are tested on real AMI recordings containing overlapping speech, and are evaluated subjectively by human listeners. To objectively evaluate our models, we also devise a synthetic AMI test set. For human evaluations on real recordings, we also propose a modification of the standard MUSHRA protocol to handle imperfect reference signals, which we call MUSHIRA. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing