A Systematic Exploration of Joint-training for Singing Voice Synthesis
Yuning Wu, Yifeng Yu, Jiatong Shi, Tao Qian, Qin Jin

TL;DR
This paper systematically investigates joint-training strategies for singing voice synthesis, demonstrating improved stability and interpretability over traditional separate models through extensive experiments.
Contribution
It introduces a novel joint-training approach for acoustic models and vocoders in SVS, addressing the gap caused by separate optimization.
Findings
Joint-training outperforms baseline models in stability.
Enhanced interpretability of the SVS framework.
Consistent performance across multiple datasets.
Abstract
There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar problem has been addressed in the TTS systems by joint-training or by replacing acoustic features with a latent representation, adopting corresponding approaches to SVS is not an easy task. How to improve the joint-training of SVS systems has not been well explored. In this paper, we conduct a systematic investigation of how to better perform a joint-training of an acoustic model and a vocoder for SVS. We carry out extensive experiments and demonstrate that our joint-training strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
