End-to-End Diarization utilizing Attractor Deep Clustering
David Palzer, Matthew Maciejewski, Eric Fosler-Lussier

TL;DR
This paper introduces a novel end-to-end speaker diarization framework that combines conformer decoders, transformer-updated attractors, and deep clustering techniques to improve speaker separation and robustness in varied conditions.
Contribution
It presents a compact, integrated diarization approach that enhances speaker representations and enforces structured embeddings through innovative deep clustering and orthogonality constraints.
Findings
Achieves low diarization error rates in experiments.
Maintains a parameter-efficient model.
Improves speaker separation robustness.
Abstract
Speaker diarization remains challenging due to the need for structured speaker representations, efficient modeling, and robustness to varying conditions. We propose a performant, compact diarization framework that integrates conformer decoders, transformer-updated attractors, and a deep clustering style angle loss. Our approach refines speaker representations with an enhanced conformer structure, incorporating cross-attention to attractors and an additional convolution module. To enforce structured embeddings, we extend deep clustering by constructing label-attractor vectors, aligning their directional structure with audio embeddings. We also impose orthogonality constraints on active attractors for better speaker separation while suppressing non-active attractors to prevent false activations. Finally, a permutation invariant training binary cross-entropy loss refines speaker detection.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications · Neural Networks and Applications
