TransFusion: Transcribing Speech with Multinomial Diffusion

Matthew Baas; Kevin Eloff; Herman Kamper

arXiv:2210.07677·eess.AS·October 17, 2022

TransFusion: Transcribing Speech with Multinomial Diffusion

Matthew Baas, Kevin Eloff, Herman Kamper

PDF

Open Access 1 Repo

TL;DR

TransFusion introduces a novel diffusion-based approach for speech recognition, converting noisy character sequences into accurate transcripts conditioned on speech features, achieving competitive results on LibriSpeech.

Contribution

This work is the first to apply denoising diffusion models to speech recognition, proposing new sampling and decoding techniques for multinomial diffusion in this domain.

Findings

01

Achieved performance comparable to high-performing contrastive models on LibriSpeech.

02

Developed effective sampling and decoding methods for multinomial diffusion models.

03

First application of diffusion models to speech recognition.

Abstract

Diffusion models have shown exceptional scaling properties in the image synthesis domain, and initial attempts have shown similar benefits for applying diffusion to unconditional text synthesis. Denoising diffusion models attempt to iteratively refine a sampled noise signal until it resembles a coherent signal (such as an image or written sentence). In this work we aim to see whether the benefits of diffusion models can also be realized for speech recognition. To this end, we propose a new way to perform speech recognition using a diffusion model conditioned on pretrained speech features. Specifically, we propose TransFusion: a transcribing diffusion model which iteratively denoises a random character sequence into coherent text corresponding to the transcript of a conditioning utterance. We demonstrate comparable performance to existing high-performing contrastive models on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rf5/transfusion-asr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques

MethodsDiffusion