USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models
Guanlong Zhao, Yongqiang Wang, Jason Pelecanos, Yu Zhang, Hank Liao,, Yiling Huang, Han Lu, Quan Wang

TL;DR
This paper presents USM-SCD, a multilingual speaker change detection and ASR model based on large pretrained foundation models, achieving high accuracy across 96 languages and outperforming previous monolingual baselines.
Contribution
The paper introduces a novel multilingual speaker change detection model adapted from a large foundation model, demonstrating effective fine-tuning and state-of-the-art performance across multiple languages.
Findings
Achieves over 75% F1 score in speaker change detection across 96 languages.
Attains 85.8% F1 score on American English, surpassing previous models by 21%.
Requires only 25% of trainable parameters for optimal performance.
Abstract
We introduce a multilingual speaker change detection model (USM-SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages. This model is adapted from a speech foundation model trained on a large quantity of supervised and unsupervised data, demonstrating the utility of fine-tuning from a large generic foundation model for a downstream task. We analyze the performance of this multilingual speaker change detection model through a series of ablation studies. We show that the USM-SCD model can achieve more than 75% average speaker change detection F1 score across a test set that consists of data from 96 languages. On American English, the USM-SCD model can achieve an 85.8% speaker change detection F1 score across various public and internal test sets, beating the previous monolingual baseline model by 21% relative. We also show that we only need to fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)
