TL;DR
This paper introduces MIDI-SAG, a structured, long-form singing accompaniment generation system that uses MIDI and chord information to improve coherence and control in professional music composition.
Contribution
MIDI-SAG is a novel framework that incorporates symbolic timing and structure planning for stable, long-form, score-to-song generation, trained efficiently with pre-trained modules.
Findings
MIDI-SAG can generate coherent long-form singing accompaniments.
The system effectively integrates MIDI and chord data for structured music synthesis.
Pre-trained modules enable data-efficient training on a single GPU.
Abstract
While end-to-end lyrics-to-song models offer convenience for casual users, professional songwriters require score-to-song systems that allow them to retain authorship over the core melody. However, existing score-to-song methods are limited to short-form snippets and fail to maintain coherence in long-form generation, particularly during vocal-silent sections like intros and bridges. To address this long-form bottleneck, we propose MIDI-informed singing accompaniment generation (MIDI-SAG). Unlike conventional audio-only models, MIDI-SAG utilizes symbolic timing and chord information derived from the vocal MIDI to provide a stable musical roadmap. By incorporating structure planning, which defines temporal boundaries and semantic labels, our framework facilitates consistent generation across both vocal and non-vocal sections. We demonstrate the feasibility of this compositional pipeline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
