Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment
Abhinaba Roy, Geeta Puri, Dorien Herremans

TL;DR
Text2midi-InferAlign enhances symbolic music generation by optimizing alignment-based objectives during inference, resulting in more coherent and caption-consistent compositions without additional training.
Contribution
It introduces a novel inference-time alignment technique that improves music-text consistency in symbolic music generation models.
Findings
Significant improvements in objective evaluation metrics.
Enhanced subjective quality and coherence of generated music.
Extension of existing models without additional training.
Abstract
We present Text2midi-InferAlign, a novel technique for improving symbolic music generation at inference time. Our method leverages text-to-audio alignment and music structural alignment rewards during inference to encourage the generated music to be consistent with the input caption. Specifically, we introduce two objectives scores: a text-audio consistency score that measures rhythmic alignment between the generated music and the original text caption, and a harmonic consistency score that penalizes generated music containing notes inconsistent with the key. By optimizing these alignment-based objectives during the generation process, our model produces symbolic music that is more closely tied to the input captions, thereby improving the overall quality and coherence of the generated compositions. Our approach can extend any existing autoregressive model without requiring further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
