OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe

TL;DR
This paper introduces OWSM-CTC, an encoder-only speech model trained on 180k hours of data, achieving fast, robust, and accurate multilingual speech recognition, translation, and language identification, with significant improvements over prior models.
Contribution
The paper presents a novel encoder-only CTC-based speech model that outperforms previous models in speed, robustness, and multilingual tasks, demonstrating scalability and efficiency.
Findings
Achieves competitive ASR results and 24% improvement in speech translation.
Increases inference speed by 3-4 times and improves long-form ASR speed by 20x.
Trained on 180k hours of multilingual speech data.
Abstract
There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis
