Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation
Md Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh

TL;DR
This paper introduces a Transformer-based end-to-end speech recognition model that incorporates a time reduction layer and self-knowledge distillation, leading to improved performance and reduced computational complexity, achieving state-of-the-art results on LibriSpeech.
Contribution
The paper proposes a novel time reduction layer within Transformer encoders and a self-knowledge distillation fine-tuning method for improved ASR performance.
Findings
Outperforms existing Transformer-based ASR systems on LibriSpeech.
Reduces computational cost of self-attention with time reduction layer.
Achieves state-of-the-art WER with 30 million parameters without external data.
Abstract
End-to-end automatic speech recognition (ASR), unlike conventional ASR, does not have modules to learn the semantic representation from speech encoder. Moreover, the higher frame-rate of speech representation prevents the model to learn the semantic representation properly. Therefore, the models that are constructed by the lower frame-rate of speech encoder lead to better performance. For Transformer-based ASR, the lower frame-rate is not only important for learning better semantic representation but also for reducing the computational complexity due to the self-attention mechanism which has O(n^2) order of complexity in both training and inference. In this paper, we propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers in addition to traditional sub-sampling methods to input features that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
