HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis
Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, Tie-Yan Liu

TL;DR
HiFiSinger is a novel high-fidelity neural singing voice synthesis system that effectively handles high sampling rates by introducing multi-scale adversarial training and specialized GANs for mel-spectrogram and waveform modeling.
Contribution
The paper presents HiFiSinger, integrating multi-scale adversarial training with sub-frequency and multi-length GANs to improve high-fidelity singing voice synthesis at high sampling rates.
Findings
Achieves MOS gains of 0.32/0.44 over 48kHz/24kHz baselines.
Outperforms previous SVS systems with 0.83 MOS improvement.
Effectively models wide frequency bands and long waveforms.
Abstract
High-fidelity singing voices usually require higher sampling rate (e.g., 48kHz) to convey expression and emotion. However, higher sampling rate causes the wider frequency band and longer waveform sequences and throws challenges for singing voice synthesis (SVS) in both frequency and time domains. Conventional SVS systems that adopt small sampling rate cannot well address the above challenges. In this paper, we develop HiFiSinger, an SVS system towards high-fidelity singing voice. HiFiSinger consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder to ensure fast training and inference and also high voice quality. To tackle the difficulty of singing modeling caused by high sampling rate (wider frequency band and longer waveform), we introduce multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling. Specifically, 1) To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsConvolution · Tanh Activation · WGAN-GP Loss · Dense Connections · Dropout · HuMan(Expedia)||How do I get a human at Expedia? · Phase Shuffle · *Communicated@Fast*How Do I Communicate to Expedia? · WaveGAN
