MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR   Models

Thai-Binh Nguyen; Alexander Waibel

arXiv:2411.18152·cs.CL·January 16, 2025

MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Thai-Binh Nguyen, Alexander Waibel

PDF

Open Access

TL;DR

This paper presents a novel method for multilingual speaker attribution in automatic speech recognition that uses a frozen ASR model and speaker embeddings trained on monolingual data, achieving competitive results.

Contribution

It introduces a speaker attribution approach leveraging a frozen multilingual ASR model and weakly labeled speaker embeddings without modifying the ASR system.

Findings

01

Effective speaker attribution across multilingual datasets.

02

Competitive performance with existing methods.

03

Robustness to overlapping speech scenarios.

Abstract

Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. Existing methods often rely on complex modular systems or require extensive fine-tuning of joint modules, limiting their adaptability and general efficiency. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. Our method involves training a speaker module to predict speaker embeddings based on weak labels without requiring additional ASR model modifications. Despite being trained exclusively with non-overlapping monolingual data, our approach effectively extracts speaker attributes across diverse multilingual datasets, including those with overlapping speech. Experimental results demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems