Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation
Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro, Tanaka, Shota Orihashi

TL;DR
This paper introduces a hierarchical transformer-based large-context end-to-end speech recognition model that leverages knowledge distillation from a pre-trained language model to effectively utilize long-range context in discourse ASR tasks.
Contribution
It proposes a novel hierarchical transformer architecture for large-context ASR and a knowledge distillation training method from a pre-trained language model.
Findings
Improved recognition accuracy on Japanese discourse ASR tasks.
Effective utilization of long-range context beyond utterance boundaries.
Demonstrated superiority over traditional utterance-level models.
Abstract
We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer architecture, which has recently achieved state-of-the-art ASR performance among utterance-level ASR systems, has not yet been introduced into the large-context ASR systems. We can expect that the transformer architecture can be leveraged for effectively capturing not only input speech contexts but also long-range sequential contexts beyond utterance boundaries. Therefore, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsKnowledge Distillation
