Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection
Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, Jian Li

TL;DR
Simul-Whisper enables effective streaming speech recognition using Whisper's pre-trained model by leveraging attention-guided decoding and a novel truncation detection method, achieving minimal accuracy loss across languages.
Contribution
This work introduces a novel streaming ASR method that utilizes Whisper's cross-attention for decoding and a new truncation detection model, without requiring fine-tuning.
Findings
Achieves only 1.46% WER degradation at 1s chunks
Outperforms existing streaming ASR baselines
Works across multiple languages and architectures
Abstract
As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we observe the negative effect of the truncated words at the chunk boundaries on the decoding results and propose an integrate-and-fire-based truncation detection model to address this issue. Experiments on multiple languages and Whisper architectures show that Simul-Whisper achieves an average absolute word error rate degradation of only 1.46% at a chunk size of 1 second, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPeer-to-Peer Network Technologies · Network Security and Intrusion Detection · Advanced Bandit Algorithms Research
