Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion
Haogeng Liu, Tao Wang, Jie Cao, Ran He, Jianhua Tao

TL;DR
This paper introduces LinDiff, a linear diffusion model for speech synthesis that achieves high-quality results with significantly fewer inference steps, combining efficiency and quality in generative speech modeling.
Contribution
LinDiff employs a linear diffusion process and patch-based Transformer modeling to enable fast, high-quality speech synthesis with minimal diffusion steps.
Findings
High-quality speech synthesis with only one diffusion step
Faster inference speed compared to traditional diffusion models
Comparable quality to autoregressive models
Abstract
Denoising Diffusion Probabilistic Models have shown extraordinary ability on various generative tasks. However, their slow inference speed renders them impractical in speech synthesis. This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality. Firstly, we employ linear interpolation between the target and noise to design a diffusion sequence for training, while previously the diffusion path that links the noise and target is a curved segment. When decreasing the number of sampling steps (i.e., the number of line segments used to fit the path), the ease of fitting straight lines compared to curves allows us to generate higher quality samples from a random noise with fewer iterations. Secondly, to reduce computational complexity and achieve effective global modeling of noisy speech,…
Peer Reviews
Decision·Submitted to ICLR 2024
This model uses a linear diffusion process with a flow matching training method to model speech synthesis. Experiments show that it can generate higher-quality results with fewer denoising steps. The proposed model can synthesize speech with quality comparable to the autoregressive models with faster speed.
1. The main weakness of this paper is the lack of innovation. The key point of the paper is using a linear diffusion process with flow matching; however, this has been proposed in previous work and shown to significantly reduce the number of inference steps. 2. The authors did not prove the impact of the state incorporating Transformers and CNN architectures on the results. For example, using Transformers as the backbone of diffusion is not necessarily necessary, and authors should compare it w
As far as I checked, the proposed LinDiff is technically sound. The proposed network architecture is novel. The experimental results suggest that LinDiff is capable of generating high-quality speech even with one sampling step. In the demo page, from a subjective feeling, the quality of LinDiff is better than FastDiff.
**Presentation**: It is quite hard to proceed from the section 2 (background) to the section 3 (method). I believe there are some irrelevant formulas (e.g. Eq. (4)) in section 2 that does not contribute to the design of LinDiff. These formulas might sidetrack and, to a large extent, hinder readers' understanding. A quick fix would be to cite the contents from another paper and only keep the most influential ones (e.g. Eq. (8)). Besides, I cannot find the training loss for stage 1 in Algorithm 1,
1. The paper propose an ordinary differential equation formulation on waveform generation, which can help model to generate relatively high-fidelity speech with limited steps. 2. The paper firstly introduce a Transformer based noise predictor for waveform generation. 3. Experiments and ablation study show that the Lindiff is better than the previous baselines.
The paper is well-written and clear. I acknowledge the contributions of the paper on ODE formulation and Transformer-based noise predictor. However, if these are the main contributions, I think more experiments should be conducted to verify the effectiveness of proposed method. 1. As for the ODE formulation, apart from the proposed formulation, there exists many other formulation (e.g., ODE in Grad-TTS/NaturalSpeech 2 and the original DDPM), which can also predict the ground-truth waveform. I t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Dropout · Label Smoothing · Attention Is All You Need · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings
