Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

Haogeng Liu; Tao Wang; Jie Cao; Ran He; Jianhua Tao

arXiv:2306.05708·cs.SD·June 13, 2023·1 cites

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

Haogeng Liu, Tao Wang, Jie Cao, Ran He, Jianhua Tao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LinDiff, a linear diffusion model for speech synthesis that achieves high-quality results with significantly fewer inference steps, combining efficiency and quality in generative speech modeling.

Contribution

LinDiff employs a linear diffusion process and patch-based Transformer modeling to enable fast, high-quality speech synthesis with minimal diffusion steps.

Findings

01

High-quality speech synthesis with only one diffusion step

02

Faster inference speed compared to traditional diffusion models

03

Comparable quality to autoregressive models

Abstract

Denoising Diffusion Probabilistic Models have shown extraordinary ability on various generative tasks. However, their slow inference speed renders them impractical in speech synthesis. This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality. Firstly, we employ linear interpolation between the target and noise to design a diffusion sequence for training, while previously the diffusion path that links the noise and target is a curved segment. When decreasing the number of sampling steps (i.e., the number of line segments used to fit the path), the ease of fitting straight lines compared to curves allows us to generate higher quality samples from a random noise with fewer iterations. Secondly, to reduce computational complexity and achieve effective global modeling of noisy speech,…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 5

Strengths

This model uses a linear diffusion process with a flow matching training method to model speech synthesis. Experiments show that it can generate higher-quality results with fewer denoising steps. The proposed model can synthesize speech with quality comparable to the autoregressive models with faster speed.

Weaknesses

1. The main weakness of this paper is the lack of innovation. The key point of the paper is using a linear diffusion process with flow matching; however, this has been proposed in previous work and shown to significantly reduce the number of inference steps. 2. The authors did not prove the impact of the state incorporating Transformers and CNN architectures on the results. For example, using Transformers as the backbone of diffusion is not necessarily necessary, and authors should compare it w

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

As far as I checked, the proposed LinDiff is technically sound. The proposed network architecture is novel. The experimental results suggest that LinDiff is capable of generating high-quality speech even with one sampling step. In the demo page, from a subjective feeling, the quality of LinDiff is better than FastDiff.

Weaknesses

**Presentation**: It is quite hard to proceed from the section 2 (background) to the section 3 (method). I believe there are some irrelevant formulas (e.g. Eq. (4)) in section 2 that does not contribute to the design of LinDiff. These formulas might sidetrack and, to a large extent, hinder readers' understanding. A quick fix would be to cite the contents from another paper and only keep the most influential ones (e.g. Eq. (8)). Besides, I cannot find the training loss for stage 1 in Algorithm 1,

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The paper propose an ordinary differential equation formulation on waveform generation, which can help model to generate relatively high-fidelity speech with limited steps. 2. The paper firstly introduce a Transformer based noise predictor for waveform generation. 3. Experiments and ablation study show that the Lindiff is better than the previous baselines.

Weaknesses

The paper is well-written and clear. I acknowledge the contributions of the paper on ODE formulation and Transformer-based noise predictor. However, if these are the main contributions, I think more experiments should be conducted to verify the effectiveness of proposed method. 1. As for the ODE formulation, apart from the proposed formulation, there exists many other formulation (e.g., ODE in Grad-TTS/NaturalSpeech 2 and the original DDPM), which can also predict the ground-truth waveform. I t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Dropout · Label Smoothing · Attention Is All You Need · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings