SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture
Kehan Sui, Jinxu Xiang, Fang Jin

TL;DR
SmoothSinger is a novel conditional diffusion model for singing voice synthesis that directly refines audio in a unified framework, leveraging a dual-branch architecture and multi-resolution design to produce more natural and expressive singing voices.
Contribution
It introduces a reference-guided dual-branch diffusion architecture with a multi-resolution U-Net to improve singing voice synthesis quality without relying on vocoders.
Findings
Achieves state-of-the-art results on Opencpop dataset
Reduces artifacts and enhances naturalness in synthesized singing voices
Effectively captures pitch and spectral dependencies
Abstract
Singing voice synthesis (SVS) aims to generate expressive and high-quality vocals from musical scores, requiring precise modeling of pitch, duration, and articulation. While diffusion-based models have achieved remarkable success in image and video generation, their application to SVS remains challenging due to the complex acoustic and musical characteristics of singing, often resulting in artifacts that degrade naturalness. In this work, we propose SmoothSinger, a conditional diffusion model designed to synthesize high quality and natural singing voices. Unlike prior methods that depend on vocoders as a final stage and often introduce distortion, SmoothSinger refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines. The model adopts a reference-guided dual-branch architecture, using low-quality audio from any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsDiffusion
