ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps
Yulin Song, Guorui Sang, Jing Yu, Chuangbai Xiao

TL;DR
ConSinger is a novel singing voice synthesis method that leverages the consistency model to produce high-fidelity singing voices efficiently with minimal inference steps, balancing quality and speed.
Contribution
This paper introduces ConSinger, a new SVS approach using the consistency model to achieve high-quality synthesis with fewer inference steps, improving efficiency over diffusion-based methods.
Findings
ConSinger produces high-fidelity singing voices comparable to baseline models.
It achieves faster inference with minimal steps while maintaining quality.
Experimental results demonstrate competitive speed and quality trade-offs.
Abstract
Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
