Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR
Han Zhu, Li Wang, Jindong Wang, Gaofeng Cheng, Pengyuan Zhang,, Yonghong Yan

TL;DR
Wav2vec-S introduces a task-specific semi-supervised pre-training method that refines self-supervised models for low-resource ASR, significantly improving performance with minimal additional pre-training time.
Contribution
The paper proposes wav2vec-S, a semi-supervised pre-training approach that enhances self-supervised models specifically for low-resource ASR tasks, outperforming previous methods.
Findings
Significant WER reductions on various datasets.
Minimal increase in pre-training time.
Semi-supervised pre-training closes representation gaps.
Abstract
Self-supervised pre-training could effectively improve the performance of low-resource automatic speech recognition (ASR). However, existing self-supervised pre-training are task-agnostic, i.e., could be applied to various downstream tasks. Although it enlarges the scope of its application, the capacity of the pre-trained model is not fully utilized for the ASR task, and the learned representations may not be optimal for ASR. In this work, in order to build a better pre-trained model for low-resource ASR, we propose a pre-training approach called wav2vec-S, where we use task-specific semi-supervised pre-training to refine the self-supervised pre-trained model for the ASR task thus more effectively utilize the capacity of the pre-trained model to generate task-specific representations for ASR. Experiments show that compared to wav2vec 2.0, wav2vec-S only requires a marginal increment of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
