Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription
Xianrui Zheng, Chao Zhang, Philip C. Woodland

TL;DR
This paper introduces a tandem multitask training method for fine-tuning Wav2Vec 2.0 to perform speaker diarisation and speech recognition simultaneously, improving accuracy and efficiency in meeting transcription.
Contribution
The paper proposes a novel multitask training framework that leverages different W2V2 layers for VAD, SC, and ASR, enabling joint optimization for meeting transcription tasks.
Findings
Reduces diarisation error rates by up to 17%.
Achieves 16-17% relative reduction in DER with joint fine-tuning.
Decreases computational cost by using different layers for each task.
Abstract
Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a tandem multitask training (TMT) method is proposed to fine-tune W2V2. For speaker diarisation, the tasks of voice activity detection (VAD) and speaker classification (SC) are required, and connectionist temporal classification (CTC) is used for ASR. The multitask framework implements VAD, SC, and ASR using an early layer, middle layer, and late layer of W2V2, which coincides with the order of segmenting the audio with VAD, clustering the segments based on speaker embeddings, and transcribing each segment with ASR. Experimental results on the augmented multi-party (AMI) dataset showed that using different W2V2 layers for VAD, SC, and ASR from the earlier to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
