Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder
Runxuan Yang, Kai Li, Guo Chen, Xiaolin Hu

TL;DR
This paper introduces a novel method combining explicit spectrogram estimation with a redesigned vocoder to significantly improve the realism of synthesized singing voices, especially in high-frequency components, making synthetic audio nearly indistinguishable from real recordings.
Contribution
It proposes an integrated approach using a diffusion-based spectrogram estimator and a specialized vocoder to enhance spectrogram fidelity in singing voice synthesis.
Findings
Produced high-fidelity spectrograms that are hard to distinguish from real recordings
Maintained high audio quality with improved realism in both objective and subjective evaluations
Advances in overcoming limitations of current vocoding techniques, especially against fake spectrogram detection
Abstract
This paper addresses the challenge of enhancing the realism of vocoder-generated singing voice audio by mitigating the distinguishable disparities between synthetic and real-life recordings, particularly in high-frequency spectrogram components. Our proposed approach combines two innovations: an explicit linear spectrogram estimation step using denoising diffusion process with DiT-based neural network architecture optimized for time-frequency data, and a redesigned vocoder based on Vocos specialized in handling large linear spectrograms with increased frequency bins. This integrated method can produce audio with high-fidelity spectrograms that are challenging for both human listeners and machine classifiers to differentiate from authentic recordings. Objective and subjective evaluations demonstrate that our streamlined approach maintains high audio quality while achieving this realism.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Adversarial Robustness in Machine Learning · Speech Recognition and Synthesis
