N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for   Pronunciation Enhancement

Gyeong-Hoon Lee; Tae-Woo Kim; Hanbin Bae; Min-Ji Lee; Young-Ik Kim,; Hoon-Young Cho

arXiv:2106.15205·eess.AS·February 22, 2022

N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement

Gyeong-Hoon Lee, Tae-Woo Kim, Hanbin Bae, Min-Ji Lee, Young-Ik Kim,, Hoon-Young Cho

PDF

Open Access

TL;DR

N-Singer is a non-autoregressive Korean singing voice synthesis system that improves pronunciation accuracy and naturalness by using Transformer and convolutional modules along with voicing-aware discriminators.

Contribution

It introduces a novel non-autoregressive architecture that models linguistic and pitch information separately for improved pronunciation in singing voice synthesis.

Findings

01

Synthesizes natural singing voices in parallel.

02

Achieves more accurate pronunciation than baseline models.

03

Utilizes voicing-aware discriminators for harmonic and noise feature capture.

Abstract

Recently, end-to-end Korean singing voice systems have been designed to generate realistic singing voices. However, these systems still suffer from a lack of robustness in terms of pronunciation accuracy. In this paper, we propose N-Singer, a non-autoregressive Korean singing voice system, to synthesize accurate and pronounced Korean singing voices in parallel. N-Singer consists of a Transformer-based mel-generator, a convolutional network-based postnet, and voicing-aware discriminators. It can contribute in the following ways. First, for accurate pronunciation, N-Singer separately models linguistic and pitch information without other acoustic features. Second, to achieve improved mel-spectrograms, N-Singer uses a combination of Transformer-based modules and convolutional network-based modules. Third, in adversarial training, voicing-aware conditional discriminators are used to capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing