Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation
Longshen Ou, Xichu Ma, Ye Wang

TL;DR
This paper presents a joint learning approach for melody-to-lyric generation that improves singability by incorporating formatting and prosodic patterns, leading to higher quality and structural adherence in generated lyrics.
Contribution
It introduces a novel training framework combining general-domain pretraining, length awareness, and auxiliary supervision based on musicological insights for better lyric generation.
Findings
3.8% improvement in line-count adherence
21.4% increase in syllable-count accuracy
42.2% and 74.2% relative gains in overall quality
Abstract
Despite progress in melody-to-lyric generation, a substantial singability gap remains between machine-generated lyrics and those written by human lyricists. In this work, we aim to narrow this gap by jointly learning both wording and formatting for melody-to-lyric generation. After general-domain pretraining, our model acquires length awareness through an self-supervised stage trained on a large text-only lyric corpus. During supervised melody-to-lyric training, we introduce multiple auxiliary supervision objective informed by musicological findings on melody--lyric relationships, encouraging the model to capture fine-grained prosodic and structural patterns. Compared with na\"ive fine-tuning, our approach improves adherence to line-count and syllable-count requirements by 3.8% and 21.4% absolute, respectively, without degrading text quality. In human evaluation, it achieves 42.2% and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
