OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck, Yang, Shinji Watanabe

TL;DR
This paper introduces OWLS, a large-scale multilingual speech recognition and translation model suite, and derives neural scaling laws to predict performance, highlighting improvements for low-resource languages and emergent capabilities in large models.
Contribution
The paper presents OWLS, the largest speech model suite to date, and systematically investigates how data, model size, and compute influence multilingual speech performance, establishing neural scaling laws.
Findings
Scaling improves performance on low-resource languages.
Neural scaling laws can reliably predict speech model performance.
Large models exhibit emergent abilities in speech tasks.
Abstract
Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗espnet/owls_9B_180Kmodel· 5 dl5 dl
- 🤗espnet/owls_4B_180Kmodel· 3 dl· ♡ 53 dl♡ 5
- 🤗espnet/owls_2B_180Kmodel· 2 dl2 dl
- 🤗espnet/owls_1B_180Kmodel· 2 dl· ♡ 32 dl♡ 3
- 🤗espnet/owls_05B_180Kmodel· 1 dl1 dl
- 🤗espnet/owls_025B_180Kmodel· 4 dl4 dl
- 🤗espnet/owls_18B_180Kmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗espnet/owls_18B_360Kmodel· 2 dl· ♡ 12 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
