All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR
Takafumi Moriya, Masato Mimura, Tomohiro Tanaka, Hiroshi Sato, Ryo Masumura, Atsunori Ogawa

TL;DR
This paper introduces All-in-One ASR, a unified model supporting multiple speech recognition paradigms and modes, reducing complexity and maintaining high performance through a multi-mode joiner and joint decoding.
Contribution
The paper presents a novel unified framework that integrates CTC, attention-based encoder-decoder, and Transducer models into a single multi-mode ASR system, enabling versatile deployment.
Findings
Significantly reduces model footprint compared to separate models.
Achieves comparable or better recognition accuracy across modes.
Joint decoding improves overall recognition performance.
Abstract
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and streaming modes. While each ASR architecture offers distinct advantages and trade-offs depending on the application, maintaining separate models for each scenario incurs substantial development and deployment costs. To address this issue, we introduce a multi-mode joiner that enables seamless integration of various ASR modes within a single unified model. Experiments show that All-in-One ASR significantly reduces the total model footprint while matching or even surpassing the recognition performance of individually optimized ASR models. Furthermore, joint decoding leverages the complementary strengths of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
