Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration towards High-Quality Speech Generation from SSL features
Hien Ohnaka, Yuma Shirahata, Masaya Kawamura

TL;DR
WaveTrainerFit is a neural vocoder that enhances high-quality speech synthesis from SSL features by introducing trainable priors and fixed-point iteration, reducing inference complexity and improving naturalness and speaker similarity.
Contribution
It introduces trainable priors and fixed-point iteration to improve waveform generation efficiency and quality from SSL features, building upon and extending WaveFit.
Findings
Achieves high naturalness and speaker similarity with fewer inference steps.
Demonstrates robustness across different SSL feature extraction depths.
Requires less computational effort compared to previous models.
Abstract
We propose WaveTrainerFit, a neural vocoder that performs high-quality waveform generation from data-driven features such as SSL features. WaveTrainerFit builds upon the WaveFit vocoder, which integrates diffusion model and generative adversarial network. Furthermore, the proposed method incorporates the following key improvements: 1. By introducing trainable priors, the inference process starts from noise close to the target speech instead of Gaussian noise. 2. Reference-aware gain adjustment is performed by imposing constraints on the trainable prior to matching the speech energy. These improvements are expected to reduce the complexity of waveform modeling from data-driven features, enabling high-quality waveform generation with fewer inference steps. Through experiments, we showed that WaveTrainerFit can generate highly natural waveforms with improved speaker similarity from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
