Drop the beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation
Ziqian Ning, Shuai Wang, Yuepeng Jiang, Jixun Yao, Lei He, Shifeng, Pan, Jie Ding, Lei Xie

TL;DR
This paper introduces Freestyler, a novel system for generating rap vocals directly from lyrics and beats, utilizing language models and neural vocoders, supported by the new RapBank dataset, achieving high-quality, rhythmically aligned outputs.
Contribution
The paper presents the first system for rap vocal generation from lyrics and accompaniment, combining language models, flow matching, and neural vocoders, along with a new rap dataset.
Findings
High-quality rap vocal generation with naturalness
Strong stylistic and rhythmic alignment with beats
Effective zero-shot timbre control
Abstract
Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic Technology and Sound Studies · Speech and Audio Processing
MethodsFocus
