Self-Attention Transducers for End-to-End Speech Recognition
Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengqi Wen

TL;DR
This paper introduces a self-attention transducer (SA-T) model for end-to-end speech recognition, replacing RNNs with self-attention blocks for better parallelization and long-term dependency modeling, with improved accuracy and online decoding capabilities.
Contribution
The paper proposes a novel self-attention transducer architecture with path-aware regularization and chunk-flow mechanism for online decoding, enhancing performance over traditional RNN-T models.
Findings
21.3% relative reduction in character error rate
Effective online decoding with minimal performance loss
Demonstrated on Mandarin AISHELL-1 dataset
Abstract
Recurrent neural network transducers (RNN-T) have been successfully applied in end-to-end speech recognition. However, the recurrent structure makes it difficult for parallelization . In this paper, we propose a self-attention transducer (SA-T) for speech recognition. RNNs are replaced with self-attention blocks, which are powerful to model long-term dependencies inside sequences and able to be efficiently parallelized. Furthermore, a path-aware regularization is proposed to assist SA-T to learn alignments and improve the performance. Additionally, a chunk-flow mechanism is utilized to achieve online decoding. All experiments are conducted on a Mandarin Chinese dataset AISHELL-1. The results demonstrate that our proposed approach achieves a 21.3% relative reduction in character error rate compared with the baseline RNN-T. In addition, the SA-T with chunk-flow mechanism can perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
