MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition
Xiaohuan Zhou, Jiaming Wang, Zeyu Cui, Shiliang Zhang, Zhijie Yan,, Jingren Zhou, Chang Zhou

TL;DR
MMSpeech introduces a multi-modal, multi-task encoder-decoder pre-training framework for Mandarin speech recognition, integrating speech, text, and phoneme data to improve recognition accuracy significantly.
Contribution
The paper presents a novel multi-task pre-training approach incorporating phoneme modality and self-supervised tasks, enhancing Mandarin ASR performance over existing methods.
Findings
Achieves state-of-the-art results on AISHELL-1
Over 40% relative improvement compared to previous pre-training methods
Effectively leverages unlabeled speech and text data for pre-training
Abstract
In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
