SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based   on Source-filter Model

Jianwei Cui,Yu Gu,Chao Weng,Jie Zhang,Liping Chen,Lirong Dai

arXiv:2410.12536·eess.AS·October 17, 2024

SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model

Jianwei Cui,Yu Gu,Chao Weng,Jie Zhang,Liping Chen,Lirong Dai

PDF

Open Access

TL;DR

SiFiSinger is a high-fidelity end-to-end singing voice synthesis system that employs a source-filter approach with innovative strategies to improve pitch accuracy and synthesis quality, demonstrating promising results on a singing dataset.

Contribution

The paper introduces a novel SVS system combining source-filter modeling with decoupled features and source excitation signals for improved pitch and quality.

Findings

01

Enhanced synthesis quality demonstrated on Opencpop dataset

02

Improved pitch accuracy through source excitation signals

03

Effective decoupling of mel-spectrogram and F0 features

Abstract

This paper presents an advanced end-to-end singing voice synthesis (SVS) system based on the source-filter mechanism that directly translates lyrical and melodic cues into expressive and high-fidelity human-like singing. Similarly to VISinger 2, the proposed system also utilizes training paradigms evolved from VITS and incorporates elements like the fundamental pitch (F0) predictor and waveform generation decoder. To address the issue that the coupling of mel-spectrogram features with F0 information may introduce errors during F0 prediction, we consider two strategies. Firstly, we leverage mel-cepstrum (mcep) features to decouple the intertwined mel-spectrogram and F0 characteristics. Secondly, inspired by the neural source-filter models, we introduce source excitation signals as the representation of F0 in the SVS system, aiming to capture pitch nuances more accurately. Meanwhile,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing