Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition   via Weakly Phonetic Supervision

Saierdaer Yusuyin; Te Ma; Hao Huang; Wenbo Zhao; Zhijian Ou

arXiv:2406.02166·cs.SD·March 28, 2025

Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

Saierdaer Yusuyin, Te Ma, Hao Huang, Wenbo Zhao, Zhijian Ou

PDF

Open Access 1 Repo 5 Models

TL;DR

Whistle introduces a phoneme-based pretraining approach for multilingual speech recognition that leverages weakly phonetic supervision, improving data efficiency and crosslingual performance, especially with limited training data.

Contribution

This paper demonstrates the effectiveness of weakly phonetic supervision using IPA and G2P models for multilingual speech recognition, a novel approach compared to existing methods.

Findings

01

Phoneme-based models outperform subword and self-supervised models in low-data scenarios.

02

Whistle improves crosslingual recognition for unseen languages.

03

The approach enhances training efficiency and mitigates catastrophic forgetting.

Abstract

There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pretraining with phonetic or graphemic transcription, and self-supervised pretraining. We find that pretraining with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-spmi/cat
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsSparse Evolutionary Training