Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

Yosuke Kashiwagi; Hayato Futami; Emiru Tsunoo; Satoshi Asakawa

arXiv:2506.01439·cs.CL·June 3, 2025

Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Satoshi Asakawa

PDF

Open Access

TL;DR

Whale is a large-scale multilingual speech recognition model that combines w2v-BERT and E-Branchformer architectures, trained on diverse speech data, achieving competitive results on standard benchmarks.

Contribution

This work introduces Whale, a novel large-scale speech recognition model integrating w2v-BERT and E-Branchformer, trained on extensive diverse datasets for improved robustness and performance.

Findings

01

Achieves 2.4% WER on Librispeech test-clean

02

Outperforms Whisper large-v3 and OWSM v3.1 on benchmarks

03

Demonstrates robustness across diverse speech data

Abstract

This paper reports on the development of a large-scale speech recognition model, Whale. Similar to models such as Whisper and OWSM, Whale leverages both a large model size and a diverse, extensive dataset. Whale's architecture integrates w2v-BERT self-supervised model, an encoder-decoder backbone built on E-Branchformer, and a joint CTC-attention decoding strategy. The training corpus comprises varied speech data, of not only public corpora but also in-house data, thereby enhancing the model's robustness to different speaking styles and acoustic conditions. Through evaluations on multiple benchmarks, Whale achieved comparable performance to existing models. In particular, it achieves a word error rate of 2.4% on the Librispeech test-clean set and a character error rate of 3.4% on the CSJ eval3 set, outperforming Whisper large-v3 and OWSM v3.1.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsE-Branchformer · Sparse Evolutionary Training