An Investigation of Enhancing CTC Model for Triggered Attention-based   Streaming ASR

Huaibo Zhao; Yosuke Higuchi; Tetsuji Ogawa; Tetsunori Kobayashi

arXiv:2110.10402·cs.SD·October 22, 2021

An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR

Huaibo Zhao, Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

PDF

Open Access

TL;DR

This paper proposes a novel streaming ASR system combining Mask-CTC and triggered attention, achieving high accuracy with low latency by leveraging future context learning during encoder pre-training.

Contribution

It introduces the integration of Mask-CTC with triggered attention for improved streaming ASR performance and low latency.

Findings

01

Higher accuracy than conventional systems.

02

Lower latency achieved in experiments.

03

Effective use of future context learning.

Abstract

In the present paper, an attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system that provides high performance with low latency. The triggered attention mechanism, which performs autoregressive decoding triggered by the CTC spike, has shown to be effective in streaming ASR. However, in order to maintain high accuracy of alignment estimation based on CTC outputs, which is the key to its performance, it is inevitable that decoding should be performed with some future information input (i.e., with higher latency). It should be noted that in streaming ASR, it is desirable to be able to achieve high recognition accuracy while keeping the latency low. Therefore, the present study aims to achieve highly accurate streaming ASR with low latency by introducing Mask-CTC, which is capable of learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing