Effective Decoder Masking for Transformer Based End-to-End Speech Recognition
Shi-Yan Weng, Berlin Chen

TL;DR
This paper introduces a decoder masking training method for transformer-based end-to-end speech recognition, which improves robustness and performance by randomly masking parts of the decoder input during training.
Contribution
It proposes a novel decoder masking approach inspired by BERT and SpecAugment, enhancing generalization and accuracy of transformer-based ASR models.
Findings
Outperforms existing strong E2E ASR systems on Librispeech and TedLium datasets.
Improves robustness of the decoder to corrupted or incomplete decoding history.
Demonstrates significant performance gains with the proposed masking strategy.
Abstract
The attention-based encoder-decoder modeling paradigm has achieved promising results on a variety of speech processing tasks like automatic speech recognition (ASR), text-to-speech (TTS) and among others. This paradigm takes advantage of the generalization ability of neural networks to learn a direct mapping from an input sequence to an output sequence, without recourse to prior knowledge such as audio-text alignments or pronunciation lexicons. However, ASR models stemming from this paradigm are prone to overfitting, especially when the training data is limited. Inspired by SpecAugment and BERT-like masked language modeling, we propose in the paper a decoder masking based training approach for end-to-end (E2E) ASR models. During the training phase we randomly replace some portions of the decoder's historical text input with the symbol [mask], in order to encourage the decoder to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
