TL;DR
This paper provides a comprehensive, step-by-step tutorial explaining the open-source DiariZen speaker diarization pipeline, detailing each component and offering source code and visualizations for better understanding and reproducibility.
Contribution
It offers a self-contained, detailed walkthrough of the DiariZen pipeline, making it easier for researchers to understand, reproduce, and extend this state-of-the-art system.
Findings
DiariZen achieves leading performance across multiple benchmarks.
The tutorial includes source code, intermediate visualizations, and end-to-end execution scripts.
The pipeline integrates WavLM, Conformer, and VBx clustering for speaker diarization.
Abstract
Speaker diarization (SD) is the task of answering "who spoke when" in a multi-speaker audio stream. Classically, an SD system clusters segments of speech belonging to an individual speaker's identity. Recent years have seen substantial progress in SD through end-to-end neural diarization (EEND) approaches. DiariZen, a hybrid SD pipeline built upon a structurally pruned WavLM-Large encoder, a Conformer backend with powerset classification, and VBx clustering, represents the leading open-source state of the art at the time of writing across multiple benchmarks. Despite its strong performance, the DiariZen architecture spans several repositories and frameworks, making it difficult for researchers and practitioners to understand, reproduce, or extend the system as a whole. This tutorial paper provides a self-contained, block-by-block explanation of the complete DiariZen pipeline,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
