Hierarchical Transformer-based Large-Context End-to-end ASR with   Large-Context Knowledge Distillation

Ryo Masumura; Naoki Makishima; Mana Ihori; Akihiko Takashima; Tomohiro; Tanaka; Shota Orihashi

arXiv:2102.07935·cs.CL·February 17, 2021

Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation

Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro, Tanaka, Shota Orihashi

PDF

Open Access

TL;DR

This paper introduces a hierarchical transformer-based large-context end-to-end speech recognition model that leverages knowledge distillation from a pre-trained language model to effectively utilize long-range context in discourse ASR tasks.

Contribution

It proposes a novel hierarchical transformer architecture for large-context ASR and a knowledge distillation training method from a pre-trained language model.

Findings

01

Improved recognition accuracy on Japanese discourse ASR tasks.

02

Effective utilization of long-range context beyond utterance boundaries.

03

Demonstrated superiority over traditional utterance-level models.

Abstract

We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer architecture, which has recently achieved state-of-the-art ASR performance among utterance-level ASR systems, has not yet been introduced into the large-context ASR systems. We can expect that the transformer architecture can be leveraged for effectively capturing not only input speech contexts but also long-range sequential contexts beyond utterance boundaries. Therefore, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsKnowledge Distillation