Unsupervised pre-training for sequence to sequence speech recognition

Zhiyun Fan; Shiyu Zhou; Bo Xu

arXiv:1910.12418·cs.SD·January 3, 2020·20 cites

Unsupervised pre-training for sequence to sequence speech recognition

Zhiyun Fan, Shiyu Zhou, Bo Xu

PDF

Open Access

TL;DR

This paper introduces a two-stage unsupervised pre-training approach for seq2seq speech recognition models, leveraging unpaired speech and transcripts to improve performance across multiple datasets and languages.

Contribution

It presents a novel two-stage pre-training method combining acoustic and linguistic knowledge, enhancing seq2seq ASR performance without requiring paired data.

Findings

01

Significant reduction in character error rate on AISHELL-1 and HKUST datasets.

02

Consistent outperformance of baseline models across six languages in CALLHOME.

03

Effective cross-lingual transfer demonstrated by improved results on multiple datasets.

Abstract

This paper proposes a novel approach to pre-train encoder-decoder sequence-to-sequence (seq2seq) model with unpaired speech and transcripts respectively. Our pre-training method is divided into two stages, named acoustic pre-trianing and linguistic pre-training. In the acoustic pre-training stage, we use a large amount of speech to pre-train the encoder by predicting masked speech feature chunks with its context. In the linguistic pre-training stage, we generate synthesized speech from a large number of transcripts using a single-speaker text to speech (TTS) system, and use the synthesized paired data to pre-train decoder. This two-stage pre-training method integrates rich acoustic and linguistic knowledge into seq2seq model, which will benefit downstream automatic speech recognition (ASR) tasks. The unsupervised pre-training is finished on AISHELL-2 dataset and we apply the pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence