SegTune: Structured and Fine-Grained Control for Song Generation
Pengfei Cai, Joanna Wang, Haorui Zheng, Xu Li, Zihao Ji, Teng Ma, Zhongliang Liu, Chen Zhang, Pengfei Wan

TL;DR
SegTune is a novel non-autoregressive framework that enables structured, fine-grained control over song generation, allowing segment-level customization and improved alignment with lyrics and musical attributes.
Contribution
The paper introduces SegTune, a new controllable song generation model with segment-level control, an LLM-based duration predictor, and a large-scale data pipeline for high-quality aligned songs.
Findings
SegTune outperforms baselines in controllability and coherence.
Segment-level control improves musical structure accuracy.
The LLM-based duration predictor enhances lyric-to-music alignment.
Abstract
Recent advancements in song generation have shown promising results in generating songs from lyrics and/or global text prompts. However, most existing systems lack the ability to model the temporally varying attributes of songs, limiting fine-grained control over musical structure and dynamics. In this paper, we propose SegTune, a non-autoregressive framework for structured and controllable song generation. SegTune enables segment-level control by allowing users or large language models to specify local musical descriptions aligned to song sections.The segmental prompts are injected into the model by temporally broadcasting them to corresponding time windows, while global prompts influence the whole song to ensure stylistic coherence. To obtain accurate segment durations and enable precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Music and Audio Processing · Music Technology and Sound Studies
