A Variance-Preserving Interpolation Approach for Diffusion Models with   Applications to Single Channel Speech Enhancement and Recognition

Zilu Guo; Qing Wang; Jun Du; Jia Pan; Qing-Feng Liu; Chin-Hui

arXiv:2405.16952·eess.AS·May 28, 2024·IEEE ACM Trans. Audio Speech Lang. Process.

A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition

Zilu Guo, Qing Wang, Jun Du, Jia Pan, Qing-Feng Liu, Chin-Hui

PDF

Open Access 1 Repo

TL;DR

This paper introduces a variance-preserving interpolation diffusion model (VPIDM) that enhances speech enhancement and recognition by reducing steps, improving robustness, and outperforming traditional methods in noisy conditions.

Contribution

The paper presents a novel variance-preserving interpolation diffusion model (VPIDM) that simplifies diffusion processes and improves performance in speech enhancement and recognition tasks.

Findings

01

VPIDM requires only 25 iterative steps, unlike previous models.

02

VPIDM outperforms conventional discriminative speech enhancement algorithms.

03

VPIDM shows increased robustness across different SNR levels.

Abstract

In this paper, we propose a variance-preserving interpolation framework to improve diffusion models for single-channel speech enhancement (SE) and automatic speech recognition (ASR). This new variance-preserving interpolation diffusion model (VPIDM) approach requires only 25 iterative steps and obviates the need for a corrector, an essential element in the existing variance-exploding interpolation diffusion model (VEIDM). Two notable distinctions between VPIDM and VEIDM are the scaling function of the mean of state variables and the constraint imposed on the variance relative to the mean's scale. We conduct a systematic exploration of the theoretical mechanism underlying VPIDM and develop insights regarding VPIDM's applications in SE and ASR using VPIDM as a frontend. Our proposed approach, evaluated on two distinct data sets, demonstrates VPIDM's superior performances over conventional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zelokuo/VPIDM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research

MethodsDiffusion