Via Score to Performance: Efficient Human-Controllable Long Song Generation with Bar-Level Symbolic Notation
Tongxi Wang, Yang Yu, Qing Wang, Junlang Qian

TL;DR
This paper introduces BACH, a novel symbolic score-based model for long song generation that significantly improves controllability, quality, and efficiency over existing audio-based methods, setting new state-of-the-art results.
Contribution
BACH is the first model designed specifically for symbolic, human-editable song generation, addressing key limitations of previous audio-based approaches.
Findings
BACH achieves superior perceptual quality and controllability.
It outperforms commercial solutions like Suno in experiments.
BACH demonstrates efficiency and effectiveness with a small model size.
Abstract
Song generation is regarded as the most challenging problem in music AIGC; nonetheless, existing approaches have yet to fully overcome four persistent limitations: controllability, generalizability, perceptual quality, and duration. We argue that these shortcomings stem primarily from the prevailing paradigm of attempting to learn music theory directly from raw audio, a task that remains prohibitively difficult for current models. To address this, we present Bar-level AI Composing Helper (BACH), the first model explicitly designed for song generation through human-editable symbolic scores. BACH introduces a tokenization strategy and a symbolic generative procedure tailored to hierarchical song structure. Consequently, it achieves substantial gains in the efficiency, duration, and perceptual quality of song generation. Experiments demonstrate that BACH, with a small model size,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
