Whisfusion: Parallel ASR Decoding via a Diffusion Transformer
Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Nam-Joon Kim, Jangchan Kim, Hyun Gon Ryu, and Hyuk-Jae Lee

TL;DR
Whisfusion introduces a novel non-autoregressive ASR framework that combines a Whisper encoder with a diffusion decoder, enabling parallel processing of audio context and significantly reducing latency for long-form speech recognition.
Contribution
It is the first to fuse a pre-trained Whisper encoder with a diffusion-based decoder, overcoming AR latency bottlenecks with a parameter-efficient fine-tuning approach.
Findings
Achieves lower WER than Whisper-tiny on LibriSpeech
Up to 2.6x faster than AR baseline on long utterances
Maintains comparable latency on short audio
Abstract
Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) methods. While modern ASR encoders can process up to 30 seconds of audio at once, AR decoders still generate tokens sequentially, creating a latency bottleneck. We propose Whisfusion, the first framework to fuse a pre-trained Whisper encoder with a text diffusion decoder. This NAR architecture resolves the AR latency bottleneck by processing the entire acoustic context in parallel at every decoding step. A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities. We also introduce a batch-parallel, multi-step decoding…
Peer Reviews
Decision·Submitted to ICLR 2026
- NAR Framework: the first architecture to fuse a Whisper encoder with a text diffusion decoder for ASR5. - Parallel Diffusion Decoding (PDD): A batch-parallel, multi-step decoding strategy that improves accuracy by increasing candidate count ($k$) with negligible latency impact - Speed-Accuracy Operating Point: Achieves lower WER than Whisper-tiny (8.3% vs 9.7% on LibriSpeech test-clean) while being significantly faster (up to 2.6x) on long-form audio.
- Inadequate Baselines for Speed Claims: The paper positions the model as a high-speed alternative to autoregressive (AR) decoding but only compares it against AR models (Whisper variants). It fails to compare against established non-autoregressive or limited-context architectures (e.g., CTC-based models, Transducers with greedy decoding) that are inherently fast. Without these baselines, the claim of a "superior speed-accuracy trade-off" is unsubstantiated, as a standard 300M-parameter CTC mode
* Using diffusion for ASR is an interesting and promising direction. * Good speedups for recognition. * Study on the parallel diffusion decoding (PDD) is interesting and shows its importance.
* The experimental setting is bad. We cannot really learn much about the most relevant questions (how does such a diffusion model compare to other alternative ASR models, under the same conditions, same training data that was used implicitly or explicitly). * Phrasing it as novel is not totally correct. * It is sold as a good solution specifically for long-form ASR, but then the experiments show that this is very it performs really bad. * No real scaling laws analysis. * Analysis should be exten
Novel architecture: Using a diffusion transformer as a decoder for ASR is highly original and demonstrates an interesting fusion of generative modeling and speech recognition. Effective integration: The cross-attention adapter provides a clear and efficient way to connect the Whisper encoder and the diffusion decoder. Improved decoding strategy: The proposed multi-step batch decoding (similar to beam search) is a well-motivated idea that improves the quality-speed trade-off compared to previou
Limited performance improvement: Although Whisfusion uses the Whisper-small encoder, its performance still lags significantly behind the Whisper-small model, raising questions about the effectiveness of the diffusion-based decoding in capturing linguistic dependencies. Lack of text length modeling: The model does not explicitly predict or control text length, which could lead to substantial computational waste, especially for short utterances. Limited comparisons: The paper does not compare wi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVLSI and Analog Circuit Testing · Blind Source Separation Techniques
