Improved Consistency Training for Semi-Supervised Sequence-to-Sequence   ASR via Speech Chain Reconstruction and Self-Transcribing

Heli Qi; Sashi Novitasari; Sakriani Sakti; Satoshi Nakamura

arXiv:2205.06963·cs.CL·May 17, 2022

Improved Consistency Training for Semi-Supervised Sequence-to-Sequence ASR via Speech Chain Reconstruction and Self-Transcribing

Heli Qi, Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura

PDF

Open Access

TL;DR

This paper introduces an enhanced consistency training method for semi-supervised sequence-to-sequence speech recognition, leveraging speech chain reconstruction and dynamic pseudo labels to improve accuracy over traditional methods.

Contribution

It proposes a novel semi-supervised training paradigm that uses speech chain reconstruction and dynamic pseudo transcripts, surpassing previous static teacher-based approaches.

Findings

01

Achieves 12.2% CER reduction on LJSpeech

02

Achieves 38.6% CER reduction on LibriSpeech

03

Outperforms baseline semi-supervised methods

Abstract

Consistency regularization has recently been applied to semi-supervised sequence-to-sequence (S2S) automatic speech recognition (ASR). This principle encourages an ASR model to output similar predictions for the same input speech with different perturbations. The existing paradigm of semi-supervised S2S ASR utilizes SpecAugment as data augmentation and requires a static teacher model to produce pseudo transcripts for untranscribed speech. However, this paradigm fails to take full advantage of consistency regularization. First, the masking operations of SpecAugment may damage the linguistic contents of the speech, thus influencing the quality of pseudo labels. Second, S2S ASR requires both input speech and prefix tokens to make the next prediction. The static prefix tokens made by the offline teacher model cannot match dynamic pseudo labels during consistency training. In this work, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing