Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder

Runxuan Yang; Kai Li; Guo Chen; Xiaolin Hu

arXiv:2508.01796·cs.SD·August 5, 2025

Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder

Runxuan Yang, Kai Li, Guo Chen, Xiaolin Hu

PDF

Open Access

TL;DR

This paper introduces a novel method combining explicit spectrogram estimation with a redesigned vocoder to significantly improve the realism of synthesized singing voices, especially in high-frequency components, making synthetic audio nearly indistinguishable from real recordings.

Contribution

It proposes an integrated approach using a diffusion-based spectrogram estimator and a specialized vocoder to enhance spectrogram fidelity in singing voice synthesis.

Findings

01

Produced high-fidelity spectrograms that are hard to distinguish from real recordings

02

Maintained high audio quality with improved realism in both objective and subjective evaluations

03

Advances in overcoming limitations of current vocoding techniques, especially against fake spectrogram detection

Abstract

This paper addresses the challenge of enhancing the realism of vocoder-generated singing voice audio by mitigating the distinguishable disparities between synthetic and real-life recordings, particularly in high-frequency spectrogram components. Our proposed approach combines two innovations: an explicit linear spectrogram estimation step using denoising diffusion process with DiT-based neural network architecture optimized for time-frequency data, and a redesigned vocoder based on Vocos specialized in handling large linear spectrograms with increased frequency bins. This integrated method can produce audio with high-fidelity spectrograms that are challenging for both human listeners and machine classifiers to differentiate from authentic recordings. Objective and subjective evaluations demonstrate that our streamlined approach maintains high audio quality while achieving this realism.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Adversarial Robustness in Machine Learning · Speech Recognition and Synthesis