A Preliminary Investigation on Flexible Singing Voice Synthesis Through Decomposed Framework with Inferrable Features
Lester Phillip Violeta, Taketo Akama

TL;DR
This paper explores a decomposed, inferrable feature-based framework for singing voice synthesis, aiming to enhance flexibility, reduce dataset dependency, and enable multi-language and singer adaptation.
Contribution
It introduces a novel three-stage decomposed framework for SVS that infers features directly from audio, improving flexibility and reducing the need for extensive labeled datasets.
Findings
Framework can infer linguistic, pitch, and voice features from audio.
Enables adaptation to different languages and singers.
Potential to achieve state-of-the-art performance with added flexibility.
Abstract
We investigate the feasibility of a singing voice synthesis (SVS) system by using a decomposed framework to improve flexibility in generating singing voices. Due to data-driven approaches, SVS performs a music score-to-waveform mapping; however, the direct mapping limits control, such as being able to only synthesize in the language or the singers present in the labeled singing datasets. As collecting large singing datasets labeled with music scores is an expensive task, we investigate an alternative approach by decomposing the SVS system and inferring different singing voice features. We decompose the SVS system into three-stage modules of linguistic, pitch contour, and synthesis, in which singing voice features such as linguistic content, F0, voiced/unvoiced, singer embeddings, and loudness are directly inferred from audio. Through this decomposed framework, we show that we can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
