A Preliminary Investigation on Flexible Singing Voice Synthesis Through   Decomposed Framework with Inferrable Features

Lester Phillip Violeta; Taketo Akama

arXiv:2407.09346·cs.SD·July 15, 2024

A Preliminary Investigation on Flexible Singing Voice Synthesis Through Decomposed Framework with Inferrable Features

Lester Phillip Violeta, Taketo Akama

PDF

Open Access

TL;DR

This paper explores a decomposed, inferrable feature-based framework for singing voice synthesis, aiming to enhance flexibility, reduce dataset dependency, and enable multi-language and singer adaptation.

Contribution

It introduces a novel three-stage decomposed framework for SVS that infers features directly from audio, improving flexibility and reducing the need for extensive labeled datasets.

Findings

01

Framework can infer linguistic, pitch, and voice features from audio.

02

Enables adaptation to different languages and singers.

03

Potential to achieve state-of-the-art performance with added flexibility.

Abstract

We investigate the feasibility of a singing voice synthesis (SVS) system by using a decomposed framework to improve flexibility in generating singing voices. Due to data-driven approaches, SVS performs a music score-to-waveform mapping; however, the direct mapping limits control, such as being able to only synthesize in the language or the singers present in the labeled singing datasets. As collecting large singing datasets labeled with music scores is an expensive task, we investigate an alternative approach by decomposing the SVS system and inferring different singing voice features. We decompose the SVS system into three-stage modules of linguistic, pitch contour, and synthesis, in which singing voice features such as linguistic content, F0, voiced/unvoiced, singer embeddings, and loudness are directly inferred from audio. Through this decomposed framework, we show that we can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing