TransFusion: Transcribing Speech with Multinomial Diffusion
Matthew Baas, Kevin Eloff, Herman Kamper

TL;DR
TransFusion introduces a novel diffusion-based approach for speech recognition, converting noisy character sequences into accurate transcripts conditioned on speech features, achieving competitive results on LibriSpeech.
Contribution
This work is the first to apply denoising diffusion models to speech recognition, proposing new sampling and decoding techniques for multinomial diffusion in this domain.
Findings
Achieved performance comparable to high-performing contrastive models on LibriSpeech.
Developed effective sampling and decoding methods for multinomial diffusion models.
First application of diffusion models to speech recognition.
Abstract
Diffusion models have shown exceptional scaling properties in the image synthesis domain, and initial attempts have shown similar benefits for applying diffusion to unconditional text synthesis. Denoising diffusion models attempt to iteratively refine a sampled noise signal until it resembles a coherent signal (such as an image or written sentence). In this work we aim to see whether the benefits of diffusion models can also be realized for speech recognition. To this end, we propose a new way to perform speech recognition using a diffusion model conditioned on pretrained speech features. Specifically, we propose TransFusion: a transcribing diffusion model which iteratively denoises a random character sequence into coherent text corresponding to the transcript of a conditioning utterance. We demonstrate comparable performance to existing high-performing contrastive models on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
MethodsDiffusion
