TL;DR
This paper introduces BMdataset, a curated LilyPond music dataset, and LilyBERT, a specialized encoder, demonstrating that small, expert datasets can outperform large noisy corpora in music classification tasks.
Contribution
The creation of a curated LilyPond dataset and a domain-specific encoder, showing improved music understanding with small, high-quality data.
Findings
Fine-tuning on BMdataset outperforms pre-training on larger corpora.
Combining pre-training and fine-tuning yields highest accuracy (84.3%).
Small curated datasets can be more effective than large noisy ones for music classification.
Abstract
Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
