ASR2K: Speech Recognition for Around 2000 Languages without Audio
Xinjian Li, Florian Metze, David R Mortensen, Alan W Black, Shinji, Watanabe

TL;DR
This paper introduces a speech recognition method for nearly 2000 languages that does not require audio data, relying solely on text or n-gram statistics, enabling recognition in low-resource languages.
Contribution
It presents a novel speech recognition pipeline that operates without audio data, using multilingual models and n-gram based language models, covering thousands of low-resource languages.
Findings
Achieved 50% CER and 74% WER on Wilderness dataset with Crubadan statistics.
Improved to 45% CER and 69% WER using 10,000 raw text utterances.
Built recognition systems for 1909 languages using this approach.
Abstract
Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
MethodsTest
