OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech   Recognition, Translation, and Language Identification

Yifan Peng; Yui Sudo; Muhammad Shakeel; Shinji Watanabe

arXiv:2402.12654·cs.CL·August 28, 2024·1 cites

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces OWSM-CTC, an encoder-only speech model trained on 180k hours of data, achieving fast, robust, and accurate multilingual speech recognition, translation, and language identification, with significant improvements over prior models.

Contribution

The paper presents a novel encoder-only CTC-based speech model that outperforms previous models in speed, robustness, and multilingual tasks, demonstrating scalability and efficiency.

Findings

01

Achieves competitive ASR results and 24% improvement in speech translation.

02

Increases inference speed by 3-4 times and improves long-form ASR speed by 20x.

03

Trained on 180k hours of multilingual speech data.

Abstract

There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

espnet/espnet
pytorchOfficial

Videos

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification· underline

Taxonomy

TopicsSpeech Recognition and Synthesis