TL;DR
This paper introduces a multi-stream self-attention neural network architecture with dilated 1D convolutions for speech recognition, achieving state-of-the-art results on LibriSpeech.
Contribution
It proposes a novel multi-stream self-attention model with dilated convolutions to better handle correlated speech frames, improving speech recognition accuracy.
Findings
Achieved 2.2% WER on LibriSpeech test-clean dataset.
Outperforms previous models on speech recognition benchmarks.
Demonstrates efficiency of multi-resolution attention in speech tasks.
Abstract
Self-attention has been a huge success for many downstream tasks in NLP, which led to exploration of applying self-attention to speech problems as well. The efficacy of self-attention in speech applications, however, seems not fully blown yet since it is challenging to handle highly correlated speech frames in the context of self-attention. In this paper we propose a new neural network model architecture, namely multi-stream self-attention, to address the issue thus make the self-attention mechanism more effective for speech recognition. The proposed model architecture consists of parallel streams of self-attention encoders, and each stream has layers of 1D convolutions with dilated kernels whose dilation rates are unique given stream, followed by a self-attention layer. The self-attention mechanism in each stream pays attention to only one resolution of input speech frames and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
