SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text   Joint Pre-Training

Ankur Bapna; Yu-an Chung; Nan Wu; Anmol Gulati; Ye Jia; Jonathan H.; Clark; Melvin Johnson; Jason Riesa; Alexis Conneau; Yu Zhang

arXiv:2110.10329·cs.CL·October 22, 2021·50 cites

SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

Ankur Bapna, Yu-an Chung, Nan Wu, Anmol Gulati, Ye Jia, Jonathan H., Clark, Melvin Johnson, Jason Riesa, Alexis Conneau, Yu Zhang

PDF

Open Access

TL;DR

This paper presents a unified encoder model pre-trained on both speech and text data using self-attention, improving speech translation and ASR performance while exploring the challenges of multi-modal pre-training.

Contribution

The authors introduce a single model pre-trained jointly on speech and text with alignment losses, advancing the integration of speech and language understanding.

Findings

01

Improved speech translation BLEU scores by around 1 point.

02

Retained near state-of-the-art performance on LibriSpeech and SpeechStew.

03

Identified capacity limitations and interference issues in multi-modal pre-training.

Abstract

Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-training within a single model. We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. To further align our model representations across modalities, we leverage alignment losses, specifically Translation Language Modeling (TLM) and Speech Text Matching (STM) that make use of supervised speech-text recognition data. We demonstrate that incorporating both speech and text data during pre-training can significantly improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Weight Decay · Residual Connection · Linear Warmup With Linear Decay · WordPiece · Attention Dropout