Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings

Taous Iatariene; Alexandre Gu\'erin; Romain Serizel (MULTISPEECH)

arXiv:2508.14115·eess.AS·August 21, 2025

Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings

Taous Iatariene, Alexandre Gu\'erin, Romain Serizel (MULTISPEECH)

PDF

Open Access

TL;DR

This paper introduces a knowledge distillation approach for extracting short-context speaker embeddings to improve low-latency multi-speaker tracking, especially in overlapping speech scenarios.

Contribution

It proposes a novel KD-based training method for short-context speaker embeddings and explores blockwise identity reassignment for low-latency tracking.

Findings

01

Distilled models effectively extract short-context embeddings.

02

Models show increased robustness to overlapping speech.

03

Blockwise reassignment offers a promising low-latency tracking approach.

Abstract

Speaker embeddings are promising identity-related features that can enhance the identity assignment performance of a tracking system by leveraging its spatial predictions, i.e, by performing identity reassignment. Common speaker embedding extractors usually struggle with short temporal contexts and overlapping speech, which imposes long-term identity reassignment to exploit longer temporal contexts. However, this increases the probability of tracking system errors, which in turn impacts negatively on identity reassignment. To address this, we propose a Knowledge Distillation (KD) based training approach for short context speaker embedding extraction from two speaker mixtures. We leverage the spatial information of the speaker of interest using beamforming to reduce overlap. We study the feasibility of performing identity reassignment over blocks of fixed size, i.e., blockwise identity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Data Compression Techniques · Speech Recognition and Synthesis