Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Yu Zhang; Wei Han; James Qin; Yongqiang Wang; Ankur Bapna; Zhehuai; Chen; Nanxin Chen; Bo Li; Vera Axelrod; Gary Wang; Zhong Meng; Ke Hu; Andrew; Rosenberg; Rohit Prabhavalkar; Daniel S. Park; Parisa Haghani; Jason Riesa,; Ginger Perng; Hagen Soltau; Trevor Strohman; Bhuvana Ramabhadran; Tara; Sainath; Pedro Moreno; Chung-Cheng Chiu; Johan Schalkwyk; Fran\c{c}oise; Beaufays; Yonghui Wu

arXiv:2303.01037·cs.CL·September 26, 2023·112 cites

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai, Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew, Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa,, Ginger Perng, Hagen Soltau, Trevor Strohman

PDF

Open Access 2 Models

TL;DR

The paper presents USM, a universal speech recognition model trained on 12 million hours of multilingual data, achieving state-of-the-art results across over 100 languages with less labeled data than previous models.

Contribution

It introduces a scalable multilingual pre-training approach with novel techniques like random-projection quantization and modality matching for broad language coverage.

Findings

01

USM achieves comparable or better performance than Whisper on many languages.

02

The model performs well on both in-domain and out-of-domain speech recognition tasks.

03

Pre-training on large unlabeled datasets enables effective multilingual ASR.

Abstract

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling