PerTok: Expressive Encoding and Modeling of Symbolic Musical Ideas and Variations
Julian Lenz, Anirudh Mani

TL;DR
Cadenza is a multi-stage generative framework that uses a novel MIDI encoding to produce expressive, stylistically related musical variations and ideas, combining a transformer-based VAE and a bidirectional encoder.
Contribution
It introduces PerTok, a MIDI encoding method that captures expressive details while significantly reducing sequence length and vocabulary size, and a two-stage generative framework for expressive music synthesis.
Findings
Cadenza matches state-of-the-art models in musical quality and expressiveness.
It can generate new, stylistically related musical ideas with novelty.
Objective and human evaluations confirm its versatility and ethical design.
Abstract
We introduce Cadenza, a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas as well as unconditional generations. To accomplish this we propose a novel MIDI encoding method, PerTok (Performance Tokenizer) that captures minute expressive details whilst reducing sequence length up to 59% and vocabulary size up to 95% for polyphonic, monophonic and rhythmic tasks. The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer. The Composer model is a transformer-based Variational Autoencoder (VAE), with Rotary Positional Embeddings (RoPE)ROPE and an autoregressive decoder modified to more effectively integrate the latent codes of the input musical idea. The Performer model is a bidirectional transformer encoder that is separately trained to predict velocities and microtimings on MIDI sequences. Objective and human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies
MethodsFast Attention Via Positive Orthogonal Random Features · Performer
