Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment

Abhinaba Roy; Geeta Puri; Dorien Herremans

arXiv:2505.12669·cs.SD·May 20, 2025

Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment

Abhinaba Roy, Geeta Puri, Dorien Herremans

PDF

Open Access 1 Repo

TL;DR

Text2midi-InferAlign enhances symbolic music generation by optimizing alignment-based objectives during inference, resulting in more coherent and caption-consistent compositions without additional training.

Contribution

It introduces a novel inference-time alignment technique that improves music-text consistency in symbolic music generation models.

Findings

01

Significant improvements in objective evaluation metrics.

02

Enhanced subjective quality and coherence of generated music.

03

Extension of existing models without additional training.

Abstract

We present Text2midi-InferAlign, a novel technique for improving symbolic music generation at inference time. Our method leverages text-to-audio alignment and music structural alignment rewards during inference to encourage the generated music to be consistent with the input caption. Specifically, we introduce two objectives scores: a text-audio consistency score that measures rhythmic alignment between the generated music and the original text caption, and a harmonic consistency score that penalizes generated music containing notes inconsistent with the key. By optimizing these alignment-based objectives during the generation process, our model produces symbolic music that is more closely tied to the input captions, thereby improving the overall quality and coherence of the generated compositions. Our approach can extend any existing autoregressive model without requiring further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amaai-lab/t2m-inferalign
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis