VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by   Digital Signal Processing Synthesizer

Yongmao Zhang; Heyang Xue; Hanzhao Li; Lei Xie; Tingwei Guo; Ruixiong; Zhang; Caixia Gong

arXiv:2211.02903·cs.SD·November 8, 2022·1 cites

VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

Yongmao Zhang, Heyang Xue, Hanzhao Li, Lei Xie, Tingwei Guo, Ruixiong, Zhang, Caixia Gong

PDF

Open Access 1 Repo

TL;DR

VISinger 2 enhances end-to-end singing voice synthesis by integrating differentiable digital signal processing, addressing phase, glitch, and sampling rate issues to produce higher fidelity 44.1kHz audio with richer expression.

Contribution

The paper introduces VISinger 2, a novel SVS model that combines DSP synthesis with neural networks to improve phase handling, reduce glitches, and enable high-fidelity full-band audio generation.

Findings

01

Outperforms previous models in subjective quality

02

Achieves 44.1kHz high-fidelity singing synthesis

03

Reduces phase and glitch artifacts effectively

Abstract

End-to-end singing voice synthesis (SVS) model VISinger can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP), we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhangyongmao/VISinger2
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing