Combining TF-GridNet and Mixture Encoder for Continuous Speech   Separation for Meeting Transcription

Peter Vieting; Simon Berger; Thilo von Neumann; Christoph Boeddeker,; Ralf Schl\"uter; Reinhold Haeb-Umbach

arXiv:2309.08454·eess.AS·February 27, 2025

Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription

Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker,, Ralf Schl\"uter, Reinhold Haeb-Umbach

PDF

Open Access

TL;DR

This paper enhances speech separation for meeting transcription by integrating TF-GridNet with a mixture encoder capable of handling multiple speakers and overlaps, achieving state-of-the-art results on LibriCSS.

Contribution

It extends the mixture encoder to natural meeting scenarios with multiple speakers and varying overlaps, and evaluates its integration with TF-GridNet for improved separation.

Findings

01

Achieved new state-of-the-art on LibriCSS with a single microphone.

02

TF-GridNet significantly reduces the gap to oracle separation.

03

Further potential for improvement remains.

Abstract

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing