Self-Attention Transducers for End-to-End Speech Recognition

Zhengkun Tian; Jiangyan Yi; Jianhua Tao; Ye Bai; Zhengqi Wen

arXiv:1909.13037·eess.AS·February 25, 2020

Self-Attention Transducers for End-to-End Speech Recognition

Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengqi Wen

PDF

TL;DR

This paper introduces a self-attention transducer (SA-T) model for end-to-end speech recognition, replacing RNNs with self-attention blocks for better parallelization and long-term dependency modeling, with improved accuracy and online decoding capabilities.

Contribution

The paper proposes a novel self-attention transducer architecture with path-aware regularization and chunk-flow mechanism for online decoding, enhancing performance over traditional RNN-T models.

Findings

01

21.3% relative reduction in character error rate

02

Effective online decoding with minimal performance loss

03

Demonstrated on Mandarin AISHELL-1 dataset

Abstract

Recurrent neural network transducers (RNN-T) have been successfully applied in end-to-end speech recognition. However, the recurrent structure makes it difficult for parallelization . In this paper, we propose a self-attention transducer (SA-T) for speech recognition. RNNs are replaced with self-attention blocks, which are powerful to model long-term dependencies inside sequences and able to be efficiently parallelized. Furthermore, a path-aware regularization is proposed to assist SA-T to learn alignments and improve the performance. Additionally, a chunk-flow mechanism is utilized to achieve online decoding. All experiments are conducted on a Mandarin Chinese dataset AISHELL-1. The results demonstrate that our proposed approach achieves a 21.3% relative reduction in character error rate compared with the baseline RNN-T. In addition, the SA-T with chunk-flow mechanism can perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.