Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings
Taous Iatariene, Alexandre Gu\'erin, Romain Serizel (MULTISPEECH)

TL;DR
This paper introduces a knowledge distillation approach for extracting short-context speaker embeddings to improve low-latency multi-speaker tracking, especially in overlapping speech scenarios.
Contribution
It proposes a novel KD-based training method for short-context speaker embeddings and explores blockwise identity reassignment for low-latency tracking.
Findings
Distilled models effectively extract short-context embeddings.
Models show increased robustness to overlapping speech.
Blockwise reassignment offers a promising low-latency tracking approach.
Abstract
Speaker embeddings are promising identity-related features that can enhance the identity assignment performance of a tracking system by leveraging its spatial predictions, i.e, by performing identity reassignment. Common speaker embedding extractors usually struggle with short temporal contexts and overlapping speech, which imposes long-term identity reassignment to exploit longer temporal contexts. However, this increases the probability of tracking system errors, which in turn impacts negatively on identity reassignment. To address this, we propose a Knowledge Distillation (KD) based training approach for short context speaker embedding extraction from two speaker mixtures. We leverage the spatial information of the speaker of interest using beamforming to reduce overlap. We study the feasibility of performing identity reassignment over blocks of fixed size, i.e., blockwise identity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Data Compression Techniques · Speech Recognition and Synthesis
