MAESTRO: Matched Speech Text Representations through Modality Matching

Zhehuai Chen; Yu Zhang; Andrew Rosenberg; Bhuvana Ramabhadran; Pedro; Moreno; Ankur Bapna; Heiga Zen

arXiv:2204.03409·cs.CL·July 5, 2022

MAESTRO: Matched Speech Text Representations through Modality Matching

Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro, Moreno, Ankur Bapna, Heiga Zen

PDF

Open Access

TL;DR

Maestro introduces a self-supervised method to unify speech and text representations, improving performance on multiple speech and translation tasks by aligning modalities without complex conversions.

Contribution

The paper presents a novel algorithm for learning unified speech-text representations through sequence alignment and embedding matching, outperforming previous methods.

Findings

01

State-of-the-art results on VoxPopuli multilingual ASR with 8% WER reduction

02

Improved performance on SpeechStew ASR with 3.7% relative WER reduction

03

Enhanced multilingual speech translation with 2.8 BLEU average gain across 21 languages

Abstract

We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information. Learning aligned representations from unpaired speech and text sequences is a challenging task. Previous work either implicitly enforced the representations learnt from these two modalities to be aligned in the latent space through multitasking and parameter sharing or explicitly through conversion of modalities via speech synthesis. While the former suffers from interference between the two modalities, the latter introduces additional complexity. In this paper, we propose Maestro, a novel algorithm to learn unified representations from both these modalities simultaneously that can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling