On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu,, Shinji Watanabe

TL;DR
This paper examines how diverse data sources affect speech-to-text models and introduces improvements in data quality and processing, leading to better performance with less training data.
Contribution
It presents OWSM v3.2, which enhances speech-to-text models by addressing data heterogeneity through filtering and language model integration.
Findings
Improved model performance over previous versions.
Achieved 15% reduction in training data needed.
Enhanced data quality through filtering and language modeling.
Abstract
The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗espnet/owsm_v3model· 3 dl· ♡ 293 dl♡ 29
- 🤗espnet/owsm_v3.1_ebfmodel· 34 dl· ♡ 1734 dl♡ 17
- 🤗espnet/owsm_v3.1_ebf_basemodel· 31 dl· ♡ 331 dl♡ 3
- 🤗espnet/owsm_ctc_v3.1_1Bmodel· 31 dl· ♡ 1431 dl♡ 14
- 🤗espnet/owsm_v3.1_ebf_smallmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗espnet/owsm_v3.1_ebf_small_lowrestrictionmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗espnet/owsm_ctc_v3.2_ft_1Bmodel· 18 dl· ♡ 518 dl♡ 5
- 🤗espnet/owsm_ctc_v4_1Bmodel· 12k dl· ♡ 712k dl♡ 7
- 🤗espnet/owsm_v4_base_102Mmodel· 18 dl· ♡ 118 dl♡ 1
- 🤗espnet/owsm_v4_small_370Mmodel· 8 dl· ♡ 48 dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
