WavFT: Acoustic model finetuning with labelled and unlabelled data
Utkarsh Chauhan, Vikas Joshi, Rupesh R. Mehta

TL;DR
This paper introduces a novel acoustic model finetuning method that leverages both labelled and unlabelled data during the finetuning stage, reducing the need for large-scale pretraining and improving speech recognition accuracy.
Contribution
The paper proposes a joint training approach combining classification and contrastive losses for acoustic model finetuning with labelled and unlabelled data, outperforming traditional methods.
Findings
Achieved 11.2% WERR reduction on Gujarati
Achieved 9.19% WERR reduction on Bengali
Effective use of unlabelled data during finetuning
Abstract
Unsupervised and self-supervised learning methods have leveraged unlabelled data to improve the pretrained models. However, these methods need significantly large amount of unlabelled data and the computational cost of training models with such large amount of data can be prohibitively high. We address this issue by using unlabelled data during finetuning, instead of pretraining. We propose acoustic model finetuning (FT) using labelled and unlabelled data. The model is jointly trained to learn representations to classify senones, as well as learn contextual acoustic representations. Our training objective is a combination of cross entropy loss, suitable for classification task, and contrastive loss, suitable to learn acoustic representations. The proposed approach outperforms conventional finetuning with 11.2% and 9.19% word error rate relative (WERR) reduction on Gujarati and Bengali…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
