Vocoder-Projected Feature Discriminator
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo

TL;DR
This paper introduces VPFD, a vocoder-projected feature discriminator that enhances voice conversion quality while significantly reducing training time and memory usage by leveraging vocoder features for adversarial training.
Contribution
The paper proposes a novel vocoder-projected feature discriminator that uses vocoder features for efficient adversarial training in voice conversion tasks.
Findings
Achieves VC performance comparable to waveform discriminators.
Reduces training time by 9.6 times.
Reduces memory consumption by 11.4 times.
Abstract
In text-to-speech (TTS) and voice conversion (VC), acoustic features, such as mel spectrograms, are typically used as synthesis or conversion targets owing to their compactness and ease of learning. However, because the ultimate goal is to generate high-quality waveforms, employing a vocoder to convert these features into waveforms and applying adversarial training in the time domain is reasonable. Nevertheless, upsampling the waveform introduces significant time and memory overheads. To address this issue, we propose a vocoder-projected feature discriminator (VPFD), which uses vocoder features for adversarial training. Experiments on diffusion-based VC distillation demonstrated that a pretrained and frozen vocoder feature extractor with a single upsampling step is necessary and sufficient to achieve a VC performance comparable to that of waveform discriminators while reducing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
