SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model
Jianwei Cui,Yu Gu,Chao Weng,Jie Zhang,Liping Chen,Lirong Dai

TL;DR
SiFiSinger is a high-fidelity end-to-end singing voice synthesis system that employs a source-filter approach with innovative strategies to improve pitch accuracy and synthesis quality, demonstrating promising results on a singing dataset.
Contribution
The paper introduces a novel SVS system combining source-filter modeling with decoupled features and source excitation signals for improved pitch and quality.
Findings
Enhanced synthesis quality demonstrated on Opencpop dataset
Improved pitch accuracy through source excitation signals
Effective decoupling of mel-spectrogram and F0 features
Abstract
This paper presents an advanced end-to-end singing voice synthesis (SVS) system based on the source-filter mechanism that directly translates lyrical and melodic cues into expressive and high-fidelity human-like singing. Similarly to VISinger 2, the proposed system also utilizes training paradigms evolved from VITS and incorporates elements like the fundamental pitch (F0) predictor and waveform generation decoder. To address the issue that the coupling of mel-spectrogram features with F0 information may introduce errors during F0 prediction, we consider two strategies. Firstly, we leverage mel-cepstrum (mcep) features to decouple the intertwined mel-spectrogram and F0 characteristics. Secondly, inspired by the neural source-filter models, we introduce source excitation signals as the representation of F0 in the SVS system, aiming to capture pitch nuances more accurately. Meanwhile,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
