Unsupervised pre-training for sequence to sequence speech recognition
Zhiyun Fan, Shiyu Zhou, Bo Xu

TL;DR
This paper introduces a two-stage unsupervised pre-training approach for seq2seq speech recognition models, leveraging unpaired speech and transcripts to improve performance across multiple datasets and languages.
Contribution
It presents a novel two-stage pre-training method combining acoustic and linguistic knowledge, enhancing seq2seq ASR performance without requiring paired data.
Findings
Significant reduction in character error rate on AISHELL-1 and HKUST datasets.
Consistent outperformance of baseline models across six languages in CALLHOME.
Effective cross-lingual transfer demonstrated by improved results on multiple datasets.
Abstract
This paper proposes a novel approach to pre-train encoder-decoder sequence-to-sequence (seq2seq) model with unpaired speech and transcripts respectively. Our pre-training method is divided into two stages, named acoustic pre-trianing and linguistic pre-training. In the acoustic pre-training stage, we use a large amount of speech to pre-train the encoder by predicting masked speech feature chunks with its context. In the linguistic pre-training stage, we generate synthesized speech from a large number of transcripts using a single-speaker text to speech (TTS) system, and use the synthesized paired data to pre-train decoder. This two-stage pre-training method integrates rich acoustic and linguistic knowledge into seq2seq model, which will benefit downstream automatic speech recognition (ASR) tasks. The unsupervised pre-training is finished on AISHELL-2 dataset and we apply the pre-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
