SCDNet: Self-supervised Learning Feature-based Speaker Change Detection
Yue Li, Xinsheng Wang, Li Zhang, Lei Xie

TL;DR
This paper investigates the use of self-supervised learning features for speaker change detection, proposing SCDNet and a contrastive learning method, demonstrating WavLm's superiority and effective model design.
Contribution
It introduces SCDNet, explores SSL model layers with a learnable weighting, and proposes contrastive learning to improve speaker change detection performance.
Findings
WavLm outperforms other SSL models in SCD.
SCDNet effectively leverages SSL features for SCD.
Contrastive learning reduces overfitting in SCD models.
Abstract
Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav2vec 2.0, and WavLm are investigated. To discern the most potent layer of SSL models for SCD, a learnable weighting method is employed to analyze the effectiveness of intermediate representations. Additionally, a fine-tuning-based approach is also implemented to further compare the characteristics of SSL models in the SCD task. Furthermore, a contrastive learning method is proposed to mitigate the overfitting tendencies in the training of both the fine-tuning-based method and SCDNet.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsContrastive Learning
