HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Jiawei Chen; Xu Tan; Jian Luan; Tao Qin; Tie-Yan Liu

arXiv:2009.01776·eess.AS·September 4, 2020·43 cites

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, Tie-Yan Liu

PDF

Open Access 1 Repo

TL;DR

HiFiSinger is a novel high-fidelity neural singing voice synthesis system that effectively handles high sampling rates by introducing multi-scale adversarial training and specialized GANs for mel-spectrogram and waveform modeling.

Contribution

The paper presents HiFiSinger, integrating multi-scale adversarial training with sub-frequency and multi-length GANs to improve high-fidelity singing voice synthesis at high sampling rates.

Findings

01

Achieves MOS gains of 0.32/0.44 over 48kHz/24kHz baselines.

02

Outperforms previous SVS systems with 0.83 MOS improvement.

03

Effectively models wide frequency bands and long waveforms.

Abstract

High-fidelity singing voices usually require higher sampling rate (e.g., 48kHz) to convey expression and emotion. However, higher sampling rate causes the wider frequency band and longer waveform sequences and throws challenges for singing voice synthesis (SVS) in both frequency and time domains. Conventional SVS systems that adopt small sampling rate cannot well address the above challenges. In this paper, we develop HiFiSinger, an SVS system towards high-fidelity singing voice. HiFiSinger consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder to ensure fast training and inference and also high voice quality. To tackle the difficulty of singing modeling caused by high sampling rate (wider frequency band and longer waveform), we introduce multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling. Specifically, 1) To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CODEJIN/HiFiSinger
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsConvolution · Tanh Activation · WGAN-GP Loss · Dense Connections · Dropout · HuMan(Expedia)||How do I get a human at Expedia? · Phase Shuffle · *Communicated@Fast*How Do I Communicate to Expedia? · WaveGAN