TL;DR
Sinsy is a DNN-based singing voice synthesis system that improves pitch accuracy, vibrato naturalness, and timing by integrating advanced modeling techniques and a neural vocoder, outperforming traditional methods in quality.
Contribution
The paper introduces a novel DNN-based SVS system with improved pitch, vibrato, and timing modeling, incorporating PeriodNet and automatic pitch correction for enhanced synthesis quality.
Findings
Better natural vibrato and timing in synthesized singing voices.
Higher mean opinion scores in subjective evaluations.
Effective pitch correction even with out-of-tune training data.
Abstract
This paper presents Sinsy, a deep neural network (DNN)-based singing voice synthesis (SVS) system. In recent years, DNNs have been utilized in statistical parametric SVS systems, and DNN-based SVS systems have demonstrated better performance than conventional hidden Markov model-based ones. SVS systems are required to synthesize a singing voice with pitch and timing that strictly follow a given musical score. Additionally, singing expressions that are not described on the musical score, such as vibrato and timing fluctuations, should be reproduced. The proposed system is composed of four modules: a time-lag model, a duration model, an acoustic model, and a vocoder, and singing voices can be synthesized taking these characteristics of singing voices into account. To better model a singing voice, the proposed system incorporates improved approaches to modeling pitch and vibrato and better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
