Massively Multilingual Shallow Fusion with Large Language Models

Ke Hu; Tara N. Sainath; Bo Li; Nan Du; Yanping Huang; Andrew M. Dai,; Yu Zhang; Rodrigo Cabrera; Zhifeng Chen; Trevor Strohman

arXiv:2302.08917·cs.CL·February 20, 2023

Massively Multilingual Shallow Fusion with Large Language Models

Ke Hu, Tara N. Sainath, Bo Li, Nan Du, Yanping Huang, Andrew M. Dai,, Yu Zhang, Rodrigo Cabrera, Zhifeng Chen, Trevor Strohman

PDF

Open Access

TL;DR

This paper introduces a massively multilingual, mixture-of-experts language model (GLaM) for shallow fusion in speech recognition, covering 84 languages and significantly improving WER across multiple languages with efficient inference.

Contribution

It presents a scalable, dynamic expert selection multilingual language model (GLaM) for shallow fusion, enabling effective speech recognition across 84 languages with reduced computational cost.

Findings

01

GLaM reduces WER by 4.4% on English long-tail test set.

02

GLaM improves 41 out of 50 languages with an average WER reduction of 3.85%.

03

GLaM achieves an average WER reduction of 5.53% over 43 languages.

Abstract

While large language models (LLM) have made impressive progress in natural language processing, it remains unclear how to utilize them in improving automatic speech recognition (ASR). In this work, we propose to train a single multilingual language model (LM) for shallow fusion in multiple languages. We push the limits of the multilingual LM to cover up to 84 languages by scaling up using a mixture-of-experts LLM, i.e., generalist language model (GLaM). When the number of experts increases, GLaM dynamically selects only two at each decoding step to keep the inference computation roughly constant. We then apply GLaM to a multilingual shallow fusion task based on a state-of-the-art end-to-end model. Compared to a dense LM of similar computation during inference, GLaM reduces the WER of an English long-tail test set by 4.4% relative. In a multilingual shallow fusion task, GLaM improves 41…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsTest