A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability
Jian Xue, Peidong Wang, Jinyu Li, Eric Sun

TL;DR
This paper presents SM2, a streaming multilingual speech model trained with weak supervision that achieves high-quality translation and transcription, including true zero-shot capabilities for unseen language pairs.
Contribution
Introduces a weakly-supervised, streaming multilingual speech model with zero-shot translation ability, trained on large-scale data without human-labeled speech translation datasets.
Findings
SM2 achieves comparable or better translation quality than non-streaming models.
SM2 demonstrates true zero-shot translation for unseen language pairs.
Model trained on 351k hours of speech data from 25 languages.
Abstract
In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high streaming capability. Instead of human labeled speech translation (ST) data, SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service. With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models. More importantly, we show that SM2 has the truly zero-shot capability when expanding to new target languages, yielding high quality ST results for {source-speech, target-text} pairs that are not seen during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
Methodstravel james · Multi-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Absolute Position Encodings
