SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture

Kehan Sui; Jinxu Xiang; Fang Jin

arXiv:2506.21478·cs.SD·June 27, 2025

SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture

Kehan Sui, Jinxu Xiang, Fang Jin

PDF

Open Access

TL;DR

SmoothSinger is a novel conditional diffusion model for singing voice synthesis that directly refines audio in a unified framework, leveraging a dual-branch architecture and multi-resolution design to produce more natural and expressive singing voices.

Contribution

It introduces a reference-guided dual-branch diffusion architecture with a multi-resolution U-Net to improve singing voice synthesis quality without relying on vocoders.

Findings

01

Achieves state-of-the-art results on Opencpop dataset

02

Reduces artifacts and enhances naturalness in synthesized singing voices

03

Effectively captures pitch and spectral dependencies

Abstract

Singing voice synthesis (SVS) aims to generate expressive and high-quality vocals from musical scores, requiring precise modeling of pitch, duration, and articulation. While diffusion-based models have achieved remarkable success in image and video generation, their application to SVS remains challenging due to the complex acoustic and musical characteristics of singing, often resulting in artifacts that degrade naturalness. In this work, we propose SmoothSinger, a conditional diffusion model designed to synthesize high quality and natural singing voices. Unlike prior methods that depend on vocoders as a final stage and often introduce distortion, SmoothSinger refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines. The model adopts a reference-guided dual-branch architecture, using low-quality audio from any…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDiffusion