Period Singer: Integrating Periodic and Aperiodic Variational   Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis

Taewoo Kim; Choongsang Cho; Young Han Lee

arXiv:2406.09894·eess.AS·September 12, 2024

Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis

Taewoo Kim, Choongsang Cho, Young Han Lee

PDF

Open Access

TL;DR

Period Singer introduces a novel end-to-end singing voice synthesis model that uses variational autoencoders for periodic and aperiodic components, improving naturalness and addressing the one-to-many problem without external aligners.

Contribution

It integrates variational autoencoders for periodic and aperiodic components and estimates phoneme alignment internally, advancing end-to-end singing voice synthesis.

Findings

01

Outperforms existing models on Mandarin and Korean datasets

02

Eliminates the need for external aligners

03

Demonstrates improved naturalness in synthesized singing voices

Abstract

In this paper, we present Period Singer, a novel end-to-end singing voice synthesis (SVS) model that utilizes variational inference for periodic and aperiodic components, aimed at producing natural-sounding waveforms. Recent end-to-end SVS models have demonstrated the capability of synthesizing high-fidelity singing voices. However, owing to deterministic pitch conditioning, they do not fully address the one-to-many problem. To address this problem, we present the Period Singer architecture, which integrates variational autoencoders for the periodic and aperiodic components. Additionally, our methodology eliminates the dependency on an external aligner by estimating the phoneme alignment through a monotonic alignment search within note boundaries. Our empirical evaluations show that Period Singer outperforms existing end-to-end SVS models on Mandarin and Korean datasets. The efficacy of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsVariational Inference