Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription
Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker,, Ralf Schl\"uter, Reinhold Haeb-Umbach

TL;DR
This paper enhances speech separation for meeting transcription by integrating TF-GridNet with a mixture encoder capable of handling multiple speakers and overlaps, achieving state-of-the-art results on LibriCSS.
Contribution
It extends the mixture encoder to natural meeting scenarios with multiple speakers and varying overlaps, and evaluates its integration with TF-GridNet for improved separation.
Findings
Achieved new state-of-the-art on LibriCSS with a single microphone.
TF-GridNet significantly reduces the gap to oracle separation.
Further potential for improvement remains.
Abstract
Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
