Maestro-U: Leveraging joint speech-text representation learning for zero   supervised speech ASR

Zhehuai Chen; Ankur Bapna; Andrew Rosenberg; Yu Zhang; Bhuvana; Ramabhadran; Pedro Moreno; Nanxin Chen

arXiv:2210.10027·cs.CL·October 24, 2022

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana, Ramabhadran, Pedro Moreno, Nanxin Chen

PDF

Open Access

TL;DR

Maestro-U introduces a joint speech-text learning approach enabling zero-shot multilingual speech recognition for 102 languages, significantly reducing error rates without requiring transcribed speech for many languages.

Contribution

This work presents a novel joint speech-text representation learning method that enables zero supervised speech recognition across numerous languages, expanding multilingual ASR capabilities without extensive labeled data.

Findings

01

Reduced CER from 64.8% to 30.8% on zero-shot languages.

02

Achieved 68.5% relative gap closure to oracle performance.

03

Lowered CER below 15% for 19 languages.

Abstract

Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with only unlabeled speech and text in the target languages. Using the FLEURS dataset, we define the task to cover $102$ languages, where transcribed speech is available in $52$ of these languages and can be used to improve end-to-end ASR quality on the remaining $50$ . First, we show that by combining speech representations with byte-level text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling