ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal   Steps

Yulin Song; Guorui Sang; Jing Yu; Chuangbai Xiao

arXiv:2410.15342·cs.SD·March 10, 2025

ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

Yulin Song, Guorui Sang, Jing Yu, Chuangbai Xiao

PDF

Open Access

TL;DR

ConSinger is a novel singing voice synthesis method that leverages the consistency model to produce high-fidelity singing voices efficiently with minimal inference steps, balancing quality and speed.

Contribution

This paper introduces ConSinger, a new SVS approach using the consistency model to achieve high-quality synthesis with fewer inference steps, improving efficiency over diffusion-based methods.

Findings

01

ConSinger produces high-fidelity singing voices comparable to baseline models.

02

It achieves faster inference with minimal steps while maintaining quality.

03

Experimental results demonstrate competitive speed and quality trade-offs.

Abstract

Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion