Multilingual Zero Resource Speech Recognition Base on Self-Supervise   Pre-Trained Acoustic Models

Haoyu Wang; Wei-Qiang Zhang; Hongbin Suo; Yulong Wan

arXiv:2210.06936·cs.CL·January 14, 2025

Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models

Haoyu Wang, Wei-Qiang Zhang, Hongbin Suo, Yulong Wan

PDF

Open Access

TL;DR

This paper introduces a novel approach for zero-resource speech recognition across multiple languages by fine-tuning self-supervised pre-trained models on phoneme transcriptions and decoding with language models, achieving promising results.

Contribution

It is the first to extend pre-trained models to word-level zero-resource speech recognition, significantly improving performance over previous methods.

Findings

01

Achieved less than 20% WER on some languages

02

Average WER across 8 languages is 33.77%

03

Demonstrated the effectiveness of fine-tuning pre-trained models for zero-resource tasks

Abstract

Labeled audio data is insufficient to build satisfying speech recognition systems for most of the languages in the world. There have been some zero-resource methods trying to perform phoneme or word-level speech recognition without labeled audio data of the target language, but the error rate of these methods is usually too high to be applied in real-world scenarios. Recently, the representation ability of self-supervise pre-trained models has been found to be extremely beneficial in zero-resource phoneme recognition. As far as we are concerned, this paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition. This is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts. Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing