OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan,, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang,, Jee-weon Jung, Shinji Watanabe

TL;DR
This paper introduces OWSM v3.1, an improved open-source speech model based on E-Branchformer that outperforms previous versions in accuracy and speed, with emergent zero-shot capabilities and open data licensing.
Contribution
The work presents a new E-Branchformer-based OWSM v3.1 model that enhances performance and efficiency over prior versions without additional data, and demonstrates emergent zero-shot recognition abilities.
Findings
OWSM v3.1 outperforms OWSM v3 in benchmarks.
Inference speed improved by up to 25%.
Emergent zero-shot contextual biasing recognition observed.
Abstract
Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder architectures. This work aims to improve the performance and efficiency of OWSM without additional data. We present a series of E-Branchformer-based models named OWSM v3.1, ranging from 100M to 1B parameters. OWSM v3.1 outperforms its predecessor, OWSM v3, in most evaluation benchmarks, while showing an improved inference speed of up to 25%. We further reveal the emergent ability of OWSM v3.1 in zero-shot contextual biasing speech recognition. We also provide a model trained on a subset of data with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗espnet/owsm_v3model· 3 dl· ♡ 293 dl♡ 29
- 🤗espnet/owsm_v3.1_ebfmodel· 34 dl· ♡ 1734 dl♡ 17
- 🤗espnet/owsm_v3.1_ebf_basemodel· 31 dl· ♡ 331 dl♡ 3
- 🤗espnet/owsm_ctc_v3.1_1Bmodel· 31 dl· ♡ 1431 dl♡ 14
- 🤗espnet/owsm_v3.1_ebf_smallmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗espnet/owsm_v3.1_ebf_small_lowrestrictionmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗espnet/owsm_ctc_v3.2_ft_1Bmodel· 18 dl· ♡ 518 dl♡ 5
- 🤗espnet/owsm_ctc_v4_1Bmodel· 12k dl· ♡ 712k dl♡ 7
- 🤗espnet/owsm_v4_base_102Mmodel· 18 dl· ♡ 118 dl♡ 1
- 🤗espnet/owsm_v4_small_370Mmodel· 8 dl· ♡ 48 dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsAttention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Layer Normalization · Multi-Head Attention · Adam · Softmax
