An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition
Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao, Abeer Alwan

TL;DR
This paper enhances a non-autoregressive speech recognition transformer by applying convolutional self-attention, expanding trigger masks, and using iterated loss functions, resulting in improved accuracy without external language models.
Contribution
The paper introduces several novel methods to improve CASS-NAT's accuracy, including convolutional self-attention, expanded trigger masks, and iterated loss functions.
Findings
Achieves 3.1%/7.2% WER on Librispeech test sets without external language models.
Improves WER/CER by 7%-21% over previous CASS-NAT.
Visualizations show acoustic embeddings behave like word embeddings.
Abstract
Non-autoregressive mechanisms can significantly decrease inference time for speech transformers, especially when the single step variant is applied. Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT). In this work, we propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses. First, convolution augmented self-attention blocks are applied to both the encoder and decoder modules. Second, we propose to expand the trigger mask (acoustic boundary) for each token to increase the robustness of CTC alignments. In addition, iterated loss functions are used to enhance the gradient update of low-layer parameters. Without using an external language model, the WERs of the improved CASS-NAT, when using the three methods,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsConvolution
