MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

Jingjing Tang; Xin Wang; Zhe Zhang; Junichi Yamagishi; Geraint Wiggins; George Fazekas

arXiv:2507.08530·cs.SD·July 14, 2025

MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

Jingjing Tang, Xin Wang, Zhe Zhang, Junichi Yamagishi, Geraint Wiggins, George Fazekas

PDF

Open Access

TL;DR

MIDI-VALLE is a neural codec language model that improves expressive piano performance synthesis by encoding MIDI and audio as discrete tokens, enabling better generalisation and synthesis quality across diverse styles and recordings.

Contribution

The paper introduces MIDI-VALLE, a novel neural codec model that conditions on reference audio and MIDI, encoding performances as discrete tokens for enhanced synthesis and generalisation.

Findings

01

Achieves over 75% lower Frechet Audio Distance than baseline.

02

Receives significantly more votes in listening tests, indicating higher perceived quality.

03

Demonstrates superior performance across diverse datasets.

Abstract

Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Neuroscience and Music Perception