Accompanied Singing Voice Synthesis with Fully Text-controlled Melody
Ruiqi Li, Zhiqing Hong, Yongqi Wang, Lichao Zhang, Rongjie Huang, Siqi, Zheng, Zhou Zhao

TL;DR
MelodyLM is a novel text-controlled singing voice synthesis model that generates high-quality accompanied songs with minimal user input and maximum control, using a language model approach and latent diffusion for accompaniment.
Contribution
It introduces MelodyLM, the first TTSong system that synthesizes songs with fully text-controlled melodies and minimal input requirements, advancing flexibility and quality in singing voice synthesis.
Findings
Achieves superior objective and subjective performance metrics.
Allows full control through textual prompts or MIDI input.
Requires only lyrics and a reference voice for synthesis.
Abstract
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a language model manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsLatent Diffusion Model · Diffusion
