OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary

Yui Sudo; Yusuke Fujita; Atsushi Kojima; Tomoya Mizumoto; Lianbo Liu

arXiv:2506.09448·cs.SD·June 12, 2025

OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary

Yui Sudo, Yusuke Fujita, Atsushi Kojima, Tomoya Mizumoto, Lianbo Liu

PDF

Open Access

TL;DR

This paper enhances open speech models with contextual biasing to better recognize rare words, achieving significant error rate reductions while maintaining efficiency and leveraging pre-trained knowledge.

Contribution

It integrates an existing contextual biasing method with pre-trained open speech models, preserving their knowledge and improving rare word recognition with small datasets.

Findings

01

Biasing word error rate (B-WER) reduced by 11.6 points

02

Overall WER improved by 0.9 points

03

Real-time factor decreased by 7.5%

Abstract

Speech foundation models (SFMs), such as Open Whisper-Style Speech Models (OWSM), are trained on massive datasets to achieve accurate automatic speech recognition. However, even SFMs struggle to accurately recognize rare and unseen words. While contextual biasing (CB) is a promising approach to improve recognition of such words, most CB methods are trained from scratch, resulting in lower performance than SFMs due to the lack of pre-trained knowledge. This paper integrates an existing CB method with OWSM v3.1 while freezing its pre-trained parameters. By leveraging the knowledge embedded in SFMs, the proposed method enables effective CB while preserving the advantages of SFMs, even with a small dataset. Experimental results show that the proposed method improves the biasing word error rate (B-WER) by 11.6 points, resulting in a 0.9 point improvement in the overall WER while reducing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders