An Improved Single Step Non-autoregressive Transformer for Automatic   Speech Recognition

Ruchao Fan; Wei Chu; Peng Chang; Jing Xiao; Abeer Alwan

arXiv:2106.09885·eess.AS·July 23, 2021

An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao, Abeer Alwan

PDF

Open Access

TL;DR

This paper enhances a non-autoregressive speech recognition transformer by applying convolutional self-attention, expanding trigger masks, and using iterated loss functions, resulting in improved accuracy without external language models.

Contribution

The paper introduces several novel methods to improve CASS-NAT's accuracy, including convolutional self-attention, expanded trigger masks, and iterated loss functions.

Findings

01

Achieves 3.1%/7.2% WER on Librispeech test sets without external language models.

02

Improves WER/CER by 7%-21% over previous CASS-NAT.

03

Visualizations show acoustic embeddings behave like word embeddings.

Abstract

Non-autoregressive mechanisms can significantly decrease inference time for speech transformers, especially when the single step variant is applied. Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT). In this work, we propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses. First, convolution augmented self-attention blocks are applied to both the encoder and decoder modules. Second, we propose to expand the trigger mask (acoustic boundary) for each token to increase the robustness of CTC alignments. In addition, iterated loss functions are used to enhance the gradient update of low-layer parameters. Without using an external language model, the WERs of the improved CASS-NAT, when using the three methods,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsConvolution