Whisfusion: Parallel ASR Decoding via a Diffusion Transformer

Taeyoun Kwon; Junhyuk Ahn; Taegeun Yun; Heeju Jwa; Yoonchae Choi; Siwon Park; Nam-Joon Kim; Jangchan Kim; Hyun Gon Ryu; and Hyuk-Jae Lee

arXiv:2508.07048·cs.SD·August 12, 2025

Whisfusion: Parallel ASR Decoding via a Diffusion Transformer

Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Nam-Joon Kim, Jangchan Kim, Hyun Gon Ryu, and Hyuk-Jae Lee

PDF

Open Access 1 Models 3 Reviews

TL;DR

Whisfusion introduces a novel non-autoregressive ASR framework that combines a Whisper encoder with a diffusion decoder, enabling parallel processing of audio context and significantly reducing latency for long-form speech recognition.

Contribution

It is the first to fuse a pre-trained Whisper encoder with a diffusion-based decoder, overcoming AR latency bottlenecks with a parameter-efficient fine-tuning approach.

Findings

01

Achieves lower WER than Whisper-tiny on LibriSpeech

02

Up to 2.6x faster than AR baseline on long utterances

03

Maintains comparable latency on short audio

Abstract

Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) methods. While modern ASR encoders can process up to 30 seconds of audio at once, AR decoders still generate tokens sequentially, creating a latency bottleneck. We propose Whisfusion, the first framework to fuse a pre-trained Whisper encoder with a text diffusion decoder. This NAR architecture resolves the AR latency bottleneck by processing the entire acoustic context in parallel at every decoding step. A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities. We also introduce a batch-parallel, multi-step decoding…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 0Confidence 5

Strengths

- NAR Framework: the first architecture to fuse a Whisper encoder with a text diffusion decoder for ASR5. - Parallel Diffusion Decoding (PDD): A batch-parallel, multi-step decoding strategy that improves accuracy by increasing candidate count ($k$) with negligible latency impact - Speed-Accuracy Operating Point: Achieves lower WER than Whisper-tiny (8.3% vs 9.7% on LibriSpeech test-clean) while being significantly faster (up to 2.6x) on long-form audio.

Weaknesses

- Inadequate Baselines for Speed Claims: The paper positions the model as a high-speed alternative to autoregressive (AR) decoding but only compares it against AR models (Whisper variants). It fails to compare against established non-autoregressive or limited-context architectures (e.g., CTC-based models, Transducers with greedy decoding) that are inherently fast. Without these baselines, the claim of a "superior speed-accuracy trade-off" is unsubstantiated, as a standard 300M-parameter CTC mode

Reviewer 02Rating 2Confidence 5

Strengths

* Using diffusion for ASR is an interesting and promising direction. * Good speedups for recognition. * Study on the parallel diffusion decoding (PDD) is interesting and shows its importance.

Weaknesses

* The experimental setting is bad. We cannot really learn much about the most relevant questions (how does such a diffusion model compare to other alternative ASR models, under the same conditions, same training data that was used implicitly or explicitly). * Phrasing it as novel is not totally correct. * It is sold as a good solution specifically for long-form ASR, but then the experiments show that this is very it performs really bad. * No real scaling laws analysis. * Analysis should be exten

Reviewer 03Rating 4Confidence 3

Strengths

Novel architecture: Using a diffusion transformer as a decoder for ASR is highly original and demonstrates an interesting fusion of generative modeling and speech recognition. Effective integration: The cross-attention adapter provides a clear and efficient way to connect the Whisper encoder and the diffusion decoder. Improved decoding strategy: The proposed multi-step batch decoding (similar to beam search) is a well-motivated idea that improves the quality-speed trade-off compared to previou

Weaknesses

Limited performance improvement: Although Whisfusion uses the Whisper-small encoder, its performance still lags significantly behind the Whisper-small model, raising questions about the effectiveness of the diffusion-based decoding in capturing linguistic dependencies. Lack of text length modeling: The model does not explicitly predict or control text length, which could lead to substantial computational waste, especially for short utterances. Limited comparisons: The paper does not compare wi

Code & Models

Models

🤗
taeyoun811/whisfusion
model· 2 dl· ♡ 2
2 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVLSI and Analog Circuit Testing · Blind Source Separation Techniques