Tandem Multitask Training of Speaker Diarisation and Speech Recognition   for Meeting Transcription

Xianrui Zheng; Chao Zhang; Philip C. Woodland

arXiv:2207.03852·eess.AS·July 11, 2022

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Xianrui Zheng, Chao Zhang, Philip C. Woodland

PDF

Open Access

TL;DR

This paper introduces a tandem multitask training method for fine-tuning Wav2Vec 2.0 to perform speaker diarisation and speech recognition simultaneously, improving accuracy and efficiency in meeting transcription.

Contribution

The paper proposes a novel multitask training framework that leverages different W2V2 layers for VAD, SC, and ASR, enabling joint optimization for meeting transcription tasks.

Findings

01

Reduces diarisation error rates by up to 17%.

02

Achieves 16-17% relative reduction in DER with joint fine-tuning.

03

Decreases computational cost by using different layers for each task.

Abstract

Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a tandem multitask training (TMT) method is proposed to fine-tune W2V2. For speaker diarisation, the tasks of voice activity detection (VAD) and speaker classification (SC) are required, and connectionist temporal classification (CTC) is used for ASR. The multitask framework implements VAD, SC, and ASR using an early layer, middle layer, and late layer of W2V2, which coincides with the order of segmenting the audio with VAD, clustering the segments based on speaker embeddings, and transcribing each segment with ASR. Experimental results on the augmented multi-party (AMI) dataset showed that using different W2V2 layers for VAD, SC, and ASR from the earlier to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing