Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting
Scott H. Hawley

TL;DR
This paper introduces a graphical interface for controlled music inpainting using an Hourglass Diffusion Transformer trained on MIDI images, enabling intuitive user control and enhanced note generation without latent space autoencoders.
Contribution
It presents a novel, efficient pixel-space diffusion model with graphical masking controls for music inpainting, improving user interaction and structural diversity in generated music.
Findings
Effective inpainting of melodies and accompaniments.
Enhanced note density and structural control through repainting.
Operates efficiently at longer context windows without autoencoders.
Abstract
Recent years have witnessed significant progress in generative models for music, featuring diverse architectures that balance output quality, diversity, speed, and user control. This study explores a user-friendly graphical interface enabling the drawing of masked regions for inpainting by an Hourglass Diffusion Transformer (HDiT) model trained on MIDI piano roll images. To enhance note generation in specified areas, masked regions can be "repainted" with extra noise. The non-latent HDiTs linear scaling with pixel count allows efficient generation in pixel space, providing intuitive and interpretable controls such as masking throughout the network and removing the need to operate in compressed latent spaces such as those provided by pretrained autoencoders. We demonstrate that, in addition to inpainting of melodies, accompaniment, and continuations, the use of repainting can help…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Diffusion · Position-Wise Feed-Forward Layer · Adam
