Word Order Does Not Matter For Speech Recognition
Vineel Pratap, Qiantong Xu, Tatiana Likhomanenko, Gabriel Synnaeve and, Ronan Collobert

TL;DR
This paper demonstrates that speech recognition models can be trained effectively without knowing the exact word order in transcripts, using a weakly supervised approach that achieves near-supervised performance.
Contribution
The authors introduce a novel weakly supervised training method for speech recognition that does not require word order information in transcripts.
Findings
Achieves 2.3%/4.6% WER on LibriSpeech test sets
Matches supervised baseline performance closely
Introduces a two-stage training process with pseudo-labels
Abstract
In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
