Modeling Singing F0 With Neural Network Driven Transition-Sustain Models
Kanru Hua

TL;DR
This paper introduces a neural network approach for modeling singing voice F0 curves from musical scores, effectively capturing vibratos and note boundary details by using transition and sustain models combined for continuous F0 generation.
Contribution
It proposes a novel neural network framework that separately models note transitions and sustain vibratos, improving F0 contour accuracy over traditional statistical methods.
Findings
Subjective tests show high similarity to original singing performances.
Models effectively reproduce vibratos and note boundary details.
Approach outperforms traditional statistical parametric methods.
Abstract
This study focuses on generating fundamental frequency (F0) curves of singing voice from musical scores stored in a midi-like notation. Current statistical parametric approaches to singing F0 modeling meet difficulties in reproducing vibratos and the temporal details at note boundaries due to the oversmoothing tendency of statistical models. This paper presents a neural network based solution that models a pair of neighboring notes at a time (the transition model) and uses a separate network for generating vibratos (the sustain model). Predictions from the two models are combined by summation after proper enveloping to enforce continuity. In the training phase, mild misalignment between the scores and the target F0 is addressed by back-propagating the gradients to the networks' inputs. Subjective listening tests on the NITech singing database show that transition-sustain models are able…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
