Masked Contrastive Pre-Training Improves Music Audio Key Detection
Ori Yonay, Tracy Hammond, Tianbao Yang

TL;DR
This paper demonstrates that masked contrastive self-supervised pretraining significantly enhances music key detection accuracy by producing pitch-sensitive representations, achieving state-of-the-art results without complex data augmentation.
Contribution
It is the first systematic study showing how pretraining design impacts pitch sensitivity and introduces masked contrastive embeddings for improved key detection in music.
Findings
Masked contrastive pretraining leads to competitive key detection performance.
Shallow MLPs trained on extracted features achieve SOTA results.
Learned representations encode common data augmentations naturally.
Abstract
Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
