MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech   Recognition

Xiaohuan Zhou; Jiaming Wang; Zeyu Cui; Shiliang Zhang; Zhijie Yan,; Jingren Zhou; Chang Zhou

arXiv:2212.00500·cs.MM·December 2, 2022·1 cites

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Xiaohuan Zhou, Jiaming Wang, Zeyu Cui, Shiliang Zhang, Zhijie Yan,, Jingren Zhou, Chang Zhou

PDF

Open Access 1 Repo

TL;DR

MMSpeech introduces a multi-modal, multi-task encoder-decoder pre-training framework for Mandarin speech recognition, integrating speech, text, and phoneme data to improve recognition accuracy significantly.

Contribution

The paper presents a novel multi-task pre-training approach incorporating phoneme modality and self-supervised tasks, enhancing Mandarin ASR performance over existing methods.

Findings

01

Achieves state-of-the-art results on AISHELL-1

02

Over 40% relative improvement compared to previous pre-training methods

03

Effectively leverages unlabeled speech and text data for pre-training

Abstract

In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ofa-sys/ofa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques