Adaptable Symbolic Music Infilling with MIDI-RWKV
Christian Zhou-Zheng, Philippe Pasquier

TL;DR
This paper introduces MIDI-RWKV, a lightweight foundation model for adaptable, controllable, multi-track symbolic music infilling that supports efficient style adaptation and is suitable for use on edge devices, advancing computer-assisted composition.
Contribution
The paper presents MIDI-RWKV, a novel small foundation model enabling style adaptation and controllable music infilling, with efficient finetuning methods suitable for edge devices.
Findings
MIDI-RWKV achieves coherent multi-track music infilling.
Effective style adaptation is possible with very few samples.
Model and code are publicly released for community use.
Abstract
Existing work in automatic music generation has mostly focused on end-to-end systems that generate either entire compositions or continuations of pieces, which are difficult for composers to iterate on. The area of computer-assisted composition, where generative models integrate into existing creative workflows, remains comparatively underexplored. In this study, we address the tasks of model style adaptation and multi-track, long-context, and controllable symbolic music infilling to enhance the process of computer-assisted composition. We present MIDI-RWKV, a small foundation model based on the RWKV-7 linear architecture, to enable efficient and coherent musical cocreation on edge devices. We also demonstrate that MIDI-RWKV admits an effective method of finetuning its initial state for style adaptation in the very-low-sample regime. We evaluate MIDI-RWKV and its state tuning on several…
Peer Reviews
Decision·Submitted to ICLR 2026
- Simple, efficient backbone + controllability: Using RWKV-7 for long-context infilling with numerical/categorical controls is well-motivated and clearly described. - State tuning is a neat adaptation mechanism, parameter-efficient and conceptually distinct from LoRA; the paper positions it clearly. - Reasonable benchmarking on single-section and random infilling against CA and MIDI-GPT with transparent metrics. - Human study present (28 participants) with statistical analysis, albeit limited i
- No accessible demo/audio page: For a music generation paper, the submission provides no public audio examples or interactive demo; only a claim that “code and weights [are] in the supplementary.” This makes it hard to independently judge musical quality, control fidelity, and usability. - Evaluation scope undercuts the headline claim: State-tuning experiments and the listening test are confined to POP909 melody-only finetuning, not multi-track use, limiting evidence for the paper’s core “multi
1. The application of RWKV to MIDI LM is an important work for the music AI community. 2. The results on state tuning vs LoRA is pretty interesting. 3. Nice narrative and writing.
1. The major limitation is the novelty. The encoding scheme, model architectures are all well-defined, making it a good application paper but less ideal for a ICLR paper. 2. The setting of single-section infilling is limited compared to abitrary masking. 3. Missing comparison against some types of symbolic infilling models, like diffusion-based [1]. 4. The fine-tuning datasaet is limited to only POP909. If only 99 songs are needed for training there are many other genres can be experimented on,
High Quality and Empirical Thoroughness: The paper's greatest strength is its high-quality, thorough, and well-conducted empirical evaluation. The authors compare their 35M parameter model against several relevant baselines, including a larger one (CA, 54M), and demonstrate its effectiveness. The validation of the state tuning method is particularly strong, as it includes objective metrics, a subjective listening test, and stability analysis. Practical Significance: The paper tackles a signific
$\bullet$ Limited Conceptual Novelty: As detailed in the "Contribution" section, the paper's primary weakness is its lack of fundamental research novelty. The work is a clever and effective combination of existing components (RWKV-7 architecture , REMI+ encoding , Bar-Fill objective , and state tuning ). This makes the paper feel more like a strong technical report or an application paper rather than a new research contribution for a conference on learning representations. $\bullet$ Limited Ana
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
