CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer   for Speech Recognition

Ruchao Fan; Wei Chu; Peng Chang; Jing Xiao

arXiv:2010.14725·eess.AS·February 15, 2021

CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao

PDF

Open Access

TL;DR

This paper introduces CASS-NAT, a non-autoregressive transformer for speech recognition that uses CTC alignment to enable parallel decoding, achieving competitive accuracy with significantly faster inference.

Contribution

The paper presents a novel CTC alignment-based approach for single-step non-autoregressive speech recognition, replacing word embeddings with token-level acoustic embeddings for parallel decoding.

Findings

01

Achieves 3.8% WER on Librispeech test clean

02

Runs 51.2x faster than autoregressive baselines

03

Potential WER of 2.3% with oracle alignment

Abstract

We propose a CTC alignment-based single step non-autoregressive transformer (CASS-NAT) for speech recognition. Specifically, the CTC alignment contains the information of (a) the number of tokens for decoder input, and (b) the time span of acoustics for each token. The information are used to extract acoustic representation for each token in parallel, referred to as token-level acoustic embedding which substitutes the word embedding in autoregressive transformer (AT) to achieve parallel generation in decoder. During inference, an error-based alignment sampling method is proposed to be applied to the CTC output space, reducing the WER and retaining the parallelism as well. Experimental results show that the proposed method achieves WERs of 3.8%/9.1% on Librispeech test clean/other dataset without an external LM, and a CER of 5.8% on Aishell1 Mandarin corpus, respectively1. Compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing