Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR
Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana, Ramabhadran, Pedro Moreno, Nanxin Chen

TL;DR
Maestro-U introduces a joint speech-text learning approach enabling zero-shot multilingual speech recognition for 102 languages, significantly reducing error rates without requiring transcribed speech for many languages.
Contribution
This work presents a novel joint speech-text representation learning method that enables zero supervised speech recognition across numerous languages, expanding multilingual ASR capabilities without extensive labeled data.
Findings
Reduced CER from 64.8% to 30.8% on zero-shot languages.
Achieved 68.5% relative gap closure to oracle performance.
Lowered CER below 15% for 19 languages.
Abstract
Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with only unlabeled speech and text in the target languages. Using the FLEURS dataset, we define the task to cover languages, where transcribed speech is available in of these languages and can be used to improve end-to-end ASR quality on the remaining . First, we show that by combining speech representations with byte-level text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
