DNN Speaker Tracking with Embeddings
Carlos Rodrigo Castillo-Sanchez, Leibny Paola Garcia-Perera, Anabel, Martin-Gonzalez

TL;DR
This paper introduces a novel embedding-based neural network method for online speaker tracking that significantly improves diarization accuracy over traditional PLDA-based systems, demonstrating robustness across datasets and conditions.
Contribution
The paper presents a new CNN-based speaker tracking approach that mimics PLDA classifiers, offering improved performance and robustness in multi-speaker scenarios.
Findings
17% DER improvement on DIHARD II dataset
Effective in overlapping and non-overlapping speech segments
Robust against non-target speaker interference
Abstract
In multi-speaker applications is common to have pre-computed models from enrolled speakers. Using these models to identify the instances in which these speakers intervene in a recording is the task of speaker tracking. In this paper, we propose a novel embedding-based speaker tracking method. Specifically, our design is based on a convolutional neural network that mimics a typical speaker verification PLDA (probabilistic linear discriminant analysis) classifier and finds the regions uttered by the target speakers in an online fashion. The system was studied from two different perspectives: diarization and tracking; results on both show a significant improvement over the PLDA baseline under the same experimental conditions. Two standard public datasets, CALLHOME and DIHARD II single channel, were modified to create two-speaker subsets with overlapping and non-overlapping regions. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
