Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning   with Self-Knowledge Distillation

Md Akmal Haidar; Chao Xing; Mehdi Rezagholizadeh

arXiv:2103.09903·cs.AI·March 19, 2021

Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

Md Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh

PDF

TL;DR

This paper introduces a Transformer-based end-to-end speech recognition model that incorporates a time reduction layer and self-knowledge distillation, leading to improved performance and reduced computational complexity, achieving state-of-the-art results on LibriSpeech.

Contribution

The paper proposes a novel time reduction layer within Transformer encoders and a self-knowledge distillation fine-tuning method for improved ASR performance.

Findings

01

Outperforms existing Transformer-based ASR systems on LibriSpeech.

02

Reduces computational cost of self-attention with time reduction layer.

03

Achieves state-of-the-art WER with 30 million parameters without external data.

Abstract

End-to-end automatic speech recognition (ASR), unlike conventional ASR, does not have modules to learn the semantic representation from speech encoder. Moreover, the higher frame-rate of speech representation prevents the model to learn the semantic representation properly. Therefore, the models that are constructed by the lower frame-rate of speech encoder lead to better performance. For Transformer-based ASR, the lower frame-rate is not only important for learning better semantic representation but also for reducing the computational complexity due to the self-attention mechanism which has O(n^2) order of complexity in both training and inference. In this paper, we propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers in addition to traditional sub-sampling methods to input features that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.