4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict   decoders

Yui Sudo; Muhammad Shakeel; Brian Yan; Jiatong Shi; Shinji Watanabe

arXiv:2212.10818·cs.SD·May 31, 2023

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces a unified end-to-end speech recognition model with four jointly trained decoders, enabling flexible switching and improved robustness, achieving consistent WER reduction across scenarios.

Contribution

It proposes a novel four-decoder joint training framework for CTC, attention, RNN-T, and mask-predict models, enhancing flexibility and performance in ASR systems.

Findings

01

Consistently reduced WER across experiments.

02

Joint training improves model robustness.

03

One-pass joint decoding enhances performance.

Abstract

The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models depending on the application requirement, resulting in the increased overhead of maintaining all models. Several methods for integrating two of these complementary models to mitigate the overhead issue have been proposed; however, if we integrate more models, we will further benefit from these complementary models and realize broader applications with a single system. This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict, which has the following three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing