Toward Streaming ASR with Non-Autoregressive Insertion-based Model
Yuya Fujita, Tianzi Wang, Shinji Watanabe, Motoi Omachi

TL;DR
This paper introduces a unified neural network architecture combining audio segmentation and non-autoregressive insertion-based ASR to achieve high accuracy with low real-time factor in streaming speech recognition.
Contribution
It proposes a novel integrated model that performs audio segmentation and non-autoregressive ASR simultaneously, improving efficiency and accuracy over traditional separate models.
Findings
Achieved a better trade-off between accuracy and RTF compared to autoregressive baselines.
Demonstrated effectiveness on Japanese and English datasets.
Unified model reduces latency and complexity in streaming ASR systems.
Abstract
Neural end-to-end (E2E) models have become a promising technique to realize practical automatic speech recognition (ASR) systems. When realizing such a system, one important issue is the segmentation of audio to deal with streaming input or long recording. After audio segmentation, the ASR model with a small real-time factor (RTF) is preferable because the latency of the system can be faster. Recently, E2E ASR based on non-autoregressive models becomes a promising approach since it can decode an -length token sequence with less than iterations. We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR. As a non-autoregressive ASR, the insertion-based model is used. In addition, instead of concatenating separated models for segmentation and ASR, we introduce a new architecture that realizes audio segmentation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dense Connections · Softmax · Layer Normalization · Adam
