BMdataset: A Musicologically Curated LilyPond Dataset

Matteo Spanio; Ilay Guler; Antonio Rod\`a

arXiv:2604.10628·cs.SD·April 14, 2026

BMdataset: A Musicologically Curated LilyPond Dataset

Matteo Spanio, Ilay Guler, Antonio Rod\`a

PDF

2 Repos 1 Models

TL;DR

This paper introduces BMdataset, a curated LilyPond music dataset, and LilyBERT, a specialized encoder, demonstrating that small, expert datasets can outperform large noisy corpora in music classification tasks.

Contribution

The creation of a curated LilyPond dataset and a domain-specific encoder, showing improved music understanding with small, high-quality data.

Findings

01

Fine-tuning on BMdataset outperforms pre-training on larger corpora.

02

Combining pre-training and fine-tuning yields highest accuracy (84.3%).

03

Small curated datasets can be more effective than large noisy ones for music classification.

Abstract

Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
csc-unipd/lilybert
model· 63 dl· ♡ 1
63 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.