mSLAM: Massively multilingual joint pre-training for speech and text

Ankur Bapna; Colin Cherry; Yu Zhang; Ye Jia; Melvin Johnson; Yong; Cheng; Simran Khanuja; Jason Riesa; Alexis Conneau

arXiv:2202.01374·cs.CL·February 4, 2022·59 cites

mSLAM: Massively multilingual joint pre-training for speech and text

Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin Johnson, Yong, Cheng, Simran Khanuja, Jason Riesa, Alexis Conneau

PDF

Open Access

TL;DR

mSLAM is a large-scale multilingual model that jointly pre-trains on speech and text, enabling cross-modal understanding and zero-shot translation, with improved performance on various speech tasks.

Contribution

This work introduces mSLAM, a novel joint pre-training framework for speech and text that enhances cross-lingual and cross-modal representations in a single model.

Findings

01

Improves speech translation, intent classification, and language identification.

02

Achieves zero-shot text translation without explicit text translation data.

03

Benefits from multi-modal fine-tuning to further enhance speech translation quality.

Abstract

We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis