JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Renhang Liu; Chia-Yu Hung; Navonil Majumder; Taylor Gautreaux; Amir Ali Bagherzadeh; Chuan Li; Dorien Herremans; Soujanya Poria

arXiv:2507.20880·cs.SD·July 29, 2025

JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria

PDF

1 Models 1 Datasets

TL;DR

JAM is a flow-based song generator that offers fine-grained word-level control and aesthetic alignment, improving the quality and controllability of automatic song creation compared to existing models.

Contribution

This paper introduces JAM, the first flow-based model enabling word-level timing and duration control in song generation, with aesthetic refinement via preference optimization.

Findings

01

JAM outperforms existing lyrics-to-song models in music-specific attributes.

02

The model achieves fine-grained control over vocal timing and duration.

03

A new evaluation dataset, JAME, standardizes assessment of lyrics-to-song models.

Abstract

Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
declare-lab/JAM-0.5
model· 38 dl· ♡ 36
38 dl♡ 36

Datasets

declare-lab/JAME
dataset· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.