Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training
Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li

TL;DR
Pianist Transformer leverages self-supervised pre-training and a scalable architecture to significantly improve expressive piano performance rendering, achieving human-like quality without relying on small labeled datasets.
Contribution
It introduces a unified MIDI representation, an efficient asymmetric architecture, and a large-scale self-supervised pre-training pipeline for expressive music performance.
Findings
Achieves state-of-the-art objective metrics
Attains human-level subjective ratings
Enables longer context modeling and faster inference
Abstract
Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The tokenization and the overall pipeline are very well-designed and make sense in how they help with performance rendering tasks. 2. The experiment design is clear and evaluates the model from multiple perspectives. 3. The provided listening samples are very convincing, showing massive improvement compared to existing methods.
1. Opening with an experiment figure feels a bit off and does little to support the narrative. The results in Figure 1, especially the variant without pretraining, should be detailed in the experiments section with fuller analysis. 2. The paper could more intuitively explain token representations and model I/O at each stage. Small, concrete examples would improve readability. 3. The paper’s relevance to the ICLR community is under-articulated; as written, it reads more naturally for a computer-m
1. The incorporation of self-supervised learning on unpaired performance MIDI data is meaningful and could be extended to other symbolic music tasks beyond piano rendering. 2. The experimental section is thorough, particularly the design and analysis of the subjective evaluation.
1. Two of the major claimed contributions overlap with prior work. Existing tokenization schemes such as REMI [1], MIDI-Like [2], CPWord [3], and Octuple [4] already represent pitch, duration, and velocity as discrete tokens. Moreover, the proposed note-level compression in the encoder resembles CPWord and Octuple designs, where each note is represented by a compression of fixed number of tokens (e.g., MusicBERT [4]). The authors should clarify how their unified representation provides new capab
This is a well-written paper, with problem and goal clearly depicted and methodology well illustrated. A few strengths of the paper can be summarized as follows: * **Unified Representation**: To make use of unaligned but more abundant data, the design of unified score/performance representation is well motivated in the paper. In both cases, note durations are represented as relative (inter onset) temporal units, which facilitate pre-training with performance-only data. * **Comprehensive Treat
Despite its merits, the reviewer would like to raise several points of concern, the clarification or improvement of which could further strengthen the paper. * **Baseline Comparisons**: While the paper includes two baseline models, both are relatively outdated. Incorporating more recent systems, such as (Borovik & Viro, 2023), would make the comparison more convincing and strengthen the evidence for the claimed performance improvements. * **Limited Case Study Scale**: In Sections 4.4.2 and 4.4
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Neuroscience and Music Perception
