MAESTRO: Matched Speech Text Representations through Modality Matching
Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro, Moreno, Ankur Bapna, Heiga Zen

TL;DR
Maestro introduces a self-supervised method to unify speech and text representations, improving performance on multiple speech and translation tasks by aligning modalities without complex conversions.
Contribution
The paper presents a novel algorithm for learning unified speech-text representations through sequence alignment and embedding matching, outperforming previous methods.
Findings
State-of-the-art results on VoxPopuli multilingual ASR with 8% WER reduction
Improved performance on SpeechStew ASR with 3.7% relative WER reduction
Enhanced multilingual speech translation with 2.8 BLEU average gain across 21 languages
Abstract
We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information. Learning aligned representations from unpaired speech and text sequences is a challenging task. Previous work either implicitly enforced the representations learnt from these two modalities to be aligned in the latent space through multitasking and parameter sharing or explicitly through conversion of modalities via speech synthesis. While the former suffers from interference between the two modalities, the latter introduces additional complexity. In this paper, we propose Maestro, a novel algorithm to learn unified representations from both these modalities simultaneously that can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
