Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang,, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou, Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji, Watanabe

TL;DR
This paper introduces an open-source toolkit to reproduce Whisper-style speech model training using publicly available data, enabling broader research and improvements in speech recognition and translation.
Contribution
It presents the Open Whisper-style Speech Model (OWSM), supporting more translation directions and improved training efficiency, with publicly released resources for transparency and reproducibility.
Findings
OWSM supports additional translation directions.
OWSM can be trained more efficiently.
All scripts and models will be publicly released.
Abstract
Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗espnet/owsm_v3model· 3 dl· ♡ 293 dl♡ 29
- 🤗espnet/owsm_v3.1_ebfmodel· 34 dl· ♡ 1734 dl♡ 17
- 🤗espnet/owsm_v3.1_ebf_basemodel· 31 dl· ♡ 331 dl♡ 3
- 🤗espnet/owsm_ctc_v3.1_1Bmodel· 31 dl· ♡ 1431 dl♡ 14
- 🤗espnet/owsm_v3.1_ebf_smallmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗espnet/owsm_v3.1_ebf_small_lowrestrictionmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗espnet/owsm_ctc_v3.2_ft_1Bmodel· 18 dl· ♡ 518 dl♡ 5
- 🤗espnet/owsm_ctc_v4_1Bmodel· 12k dl· ♡ 712k dl♡ 7
- 🤗espnet/owsm_v4_base_102Mmodel· 18 dl· ♡ 118 dl♡ 1
- 🤗espnet/owsm_v4_small_370Mmodel· 8 dl· ♡ 48 dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
