Loose coupling of spectral and spatial models for multi-channel diarization and enhancement of meetings in dynamic environments
Adrian Meise, Tobias Cord-Landwehr, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

TL;DR
This paper introduces a novel joint spatial and spectral mixture model for multi-channel diarization and enhancement, effectively handling speaker movement in dynamic environments by loosely coupling the two models.
Contribution
It proposes a new probabilistic framework that loosely couples spectral and spatial models, enabling improved diarization and enhancement in dynamic meeting scenarios.
Findings
Significant improvements over tightly coupled models on LibriCSS data.
Effective handling of speaker position changes during meetings.
Enhanced performance in multi-channel meeting transcription.
Abstract
Sound capture by microphone arrays opens the possibility to exploit spatial, in addition to spectral, information for diarization and signal enhancement, two important tasks in meeting transcription. However, there is no one-to-one mapping of positions in space to speakers if speakers move. Here, we address this by proposing a novel joint spatial and spectral mixture model, whose two submodels are loosely coupled by modeling the relationship between speaker and position index probabilistically. Thus, spatial and spectral information can be jointly exploited, while at the same time allowing for speakers speaking from different positions. Experiments on the LibriCSS data set with simulated speaker position changes show great improvements over tightly coupled subsystems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation
