Word Order Does Not Matter For Speech Recognition

Vineel Pratap; Qiantong Xu; Tatiana Likhomanenko; Gabriel Synnaeve and; Ronan Collobert

arXiv:2110.05994·eess.AS·October 20, 2021

Word Order Does Not Matter For Speech Recognition

Vineel Pratap, Qiantong Xu, Tatiana Likhomanenko, Gabriel Synnaeve and, Ronan Collobert

PDF

Open Access

TL;DR

This paper demonstrates that speech recognition models can be trained effectively without knowing the exact word order in transcripts, using a weakly supervised approach that achieves near-supervised performance.

Contribution

The authors introduce a novel weakly supervised training method for speech recognition that does not require word order information in transcripts.

Findings

01

Achieves 2.3%/4.6% WER on LibriSpeech test sets

02

Matches supervised baseline performance closely

03

Introduces a two-stage training process with pseudo-labels

Abstract

In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing