Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models
Haoyu Wang, Wei-Qiang Zhang, Hongbin Suo, Yulong Wan

TL;DR
This paper introduces a novel approach for zero-resource speech recognition across multiple languages by fine-tuning self-supervised pre-trained models on phoneme transcriptions and decoding with language models, achieving promising results.
Contribution
It is the first to extend pre-trained models to word-level zero-resource speech recognition, significantly improving performance over previous methods.
Findings
Achieved less than 20% WER on some languages
Average WER across 8 languages is 33.77%
Demonstrated the effectiveness of fine-tuning pre-trained models for zero-resource tasks
Abstract
Labeled audio data is insufficient to build satisfying speech recognition systems for most of the languages in the world. There have been some zero-resource methods trying to perform phoneme or word-level speech recognition without labeled audio data of the target language, but the error rate of these methods is usually too high to be applied in real-world scenarios. Recently, the representation ability of self-supervise pre-trained models has been found to be extremely beneficial in zero-resource phoneme recognition. As far as we are concerned, this paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition. This is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts. Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing
