SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training
Ankur Bapna, Yu-an Chung, Nan Wu, Anmol Gulati, Ye Jia, Jonathan H., Clark, Melvin Johnson, Jason Riesa, Alexis Conneau, Yu Zhang

TL;DR
This paper presents a unified encoder model pre-trained on both speech and text data using self-attention, improving speech translation and ASR performance while exploring the challenges of multi-modal pre-training.
Contribution
The authors introduce a single model pre-trained jointly on speech and text with alignment losses, advancing the integration of speech and language understanding.
Findings
Improved speech translation BLEU scores by around 1 point.
Retained near state-of-the-art performance on LibriSpeech and SpeechStew.
Identified capacity limitations and interference issues in multi-modal pre-training.
Abstract
Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-training within a single model. We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. To further align our model representations across modalities, we leverage alignment losses, specifically Translation Language Modeling (TLM) and Speech Text Matching (STM) that make use of supervised speech-text recognition data. We demonstrate that incorporating both speech and text data during pre-training can significantly improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Weight Decay · Residual Connection · Linear Warmup With Linear Decay · WordPiece · Attention Dropout
