X-Vectors with Multi-Scale Aggregation for Speaker Diarization
Myungjong Kim, Vijendra Raj Apsingekar, Divya Neelagiri

TL;DR
This paper introduces a multi-scale aggregation method for x-vector embeddings in speaker diarization, leveraging multiple layer statistics to improve speaker segmentation accuracy, demonstrated by significant results on the CALLHOME dataset.
Contribution
It proposes a novel multi-scale aggregation approach for x-vectors that enhances speaker diarization performance by capturing diverse speaker features from multiple network layers.
Findings
Significant improvement over baseline x-vectors on CALLHOME dataset
Multi-scale aggregation captures richer speaker characteristics
Enhanced short segment speaker representation
Abstract
Speaker diarization is the process of labeling different speakers in a speech signal. Deep speaker embeddings are generally extracted from short speech segments and clustered to determine the segments belong to same speaker identity. The x-vector, which embeds segment-level speaker characteristics by statistically pooling frame-level representations, is one of the most widely used deep speaker embeddings in speaker diarization. Multi-scale aggregation, which employs multi-scale representations from different layers, has recently successfully been used in short duration speaker verification. In this paper, we investigate a multi-scale aggregation approach in an x-vector embedding framework for speaker diarization by exploiting multiple statistics pooling layers from different frame-level layers. Thus, it is expected that x-vectors with multi-scale aggregation have the potential to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
