Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding

Haoran Zhou; Xingchen Song; Brendan Fahy; Qiaochu Song; Binbin Zhang; Zhendong Peng; Anshul Wadhawan; Denglin Jiang; Apurv Verma; Vinay Ramesh; Srivas Prasad; Michele M. Franceschini

arXiv:2506.12154·cs.SD·June 17, 2025

Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding

Haoran Zhou, Xingchen Song, Brendan Fahy, Qiaochu Song, Binbin Zhang, Zhendong Peng, Anshul Wadhawan, Denglin Jiang, Apurv Verma, Vinay Ramesh, Srivas Prasad, Michele M. Franceschini

PDF

Open Access

TL;DR

This paper adapts the Whisper speech recognition model for streaming applications by implementing a two-pass decoding approach with a CTC decoder and a reranking decoder, enhancing its real-time capabilities.

Contribution

It introduces a novel fine-tuning method for Whisper to support streaming ASR using a two-pass decoding structure with a hybrid tokenizer approach.

Findings

01

Effective streaming ASR achieved on LibriSpeech and earnings call datasets.

02

Hybrid tokenizer improves data efficiency and generalization.

03

Whisper can be adapted for streaming with adequate fine-tuning.

Abstract

OpenAI Whisper is a family of robust Automatic Speech Recognition (ASR) models trained on 680,000 hours of audio. However, its encoder-decoder architecture, trained with a sequence-to-sequence objective, lacks native support for streaming ASR. In this paper, we fine-tune Whisper for streaming ASR using the WeNet toolkit by adopting a Unified Two-pass (U2) structure. We introduce an additional Connectionist Temporal Classification (CTC) decoder trained with causal attention masks to generate streaming partial transcripts, while the original Whisper decoder reranks these partial outputs. Our experiments on LibriSpeech and an earnings call dataset demonstrate that, with adequate fine-tuning data, Whisper can be adapted into a capable streaming ASR model. We also introduce a hybrid tokenizer approach, which uses a smaller token space for the CTC decoder while retaining Whisper's original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques