Toward Streaming ASR with Non-Autoregressive Insertion-based Model

Yuya Fujita; Tianzi Wang; Shinji Watanabe; Motoi Omachi

arXiv:2012.10128·eess.AS·July 19, 2021·Interspeech

Toward Streaming ASR with Non-Autoregressive Insertion-based Model

Yuya Fujita, Tianzi Wang, Shinji Watanabe, Motoi Omachi

PDF

Open Access

TL;DR

This paper introduces a unified neural network architecture combining audio segmentation and non-autoregressive insertion-based ASR to achieve high accuracy with low real-time factor in streaming speech recognition.

Contribution

It proposes a novel integrated model that performs audio segmentation and non-autoregressive ASR simultaneously, improving efficiency and accuracy over traditional separate models.

Findings

01

Achieved a better trade-off between accuracy and RTF compared to autoregressive baselines.

02

Demonstrated effectiveness on Japanese and English datasets.

03

Unified model reduces latency and complexity in streaming ASR systems.

Abstract

Neural end-to-end (E2E) models have become a promising technique to realize practical automatic speech recognition (ASR) systems. When realizing such a system, one important issue is the segmentation of audio to deal with streaming input or long recording. After audio segmentation, the ASR model with a small real-time factor (RTF) is preferable because the latency of the system can be faster. Recently, E2E ASR based on non-autoregressive models becomes a promising approach since it can decode an $N$ -length token sequence with less than $N$ iterations. We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR. As a non-autoregressive ASR, the insertion-based model is used. In addition, instead of concatenating separated models for segmentation and ASR, we introduce a new architecture that realizes audio segmentation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dense Connections · Softmax · Layer Normalization · Adam