USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained   Foundation Models

Guanlong Zhao; Yongqiang Wang; Jason Pelecanos; Yu Zhang; Hank Liao,; Yiling Huang; Han Lu; Quan Wang

arXiv:2309.08023·eess.AS·January 9, 2024

USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models

Guanlong Zhao, Yongqiang Wang, Jason Pelecanos, Yu Zhang, Hank Liao,, Yiling Huang, Han Lu, Quan Wang

PDF

Open Access

TL;DR

This paper presents USM-SCD, a multilingual speaker change detection and ASR model based on large pretrained foundation models, achieving high accuracy across 96 languages and outperforming previous monolingual baselines.

Contribution

The paper introduces a novel multilingual speaker change detection model adapted from a large foundation model, demonstrating effective fine-tuning and state-of-the-art performance across multiple languages.

Findings

01

Achieves over 75% F1 score in speaker change detection across 96 languages.

02

Attains 85.8% F1 score on American English, surpassing previous models by 21%.

03

Requires only 25% of trainable parameters for optimal performance.

Abstract

We introduce a multilingual speaker change detection model (USM-SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages. This model is adapted from a speech foundation model trained on a large quantity of supervised and unsupervised data, demonstrating the utility of fine-tuning from a large generic foundation model for a downstream task. We analyze the performance of this multilingual speaker change detection model through a series of ablation studies. We show that the USM-SCD model can achieve more than 75% average speaker change detection F1 score across a test set that consists of data from 96 languages. On American English, the USM-SCD model can achieve an 85.8% speaker change detection F1 score across various public and internal test sets, beating the previous monolingual baseline model by 21% relative. We also show that we only need to fine-tune…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)