Emergent musical properties of a transformer under contrastive self-supervised learning

Yuexuan Kong; Gabriel Meseguer-Brocal; Vincent Lostanlen; Mathieu Lagrange; Romain Hennequin

arXiv:2506.23873·cs.SD·July 1, 2025

Emergent musical properties of a transformer under contrastive self-supervised learning

Yuexuan Kong, Gabriel Meseguer-Brocal, Vincent Lostanlen, Mathieu Lagrange, Romain Hennequin

PDF

Open Access

TL;DR

This paper demonstrates that contrastive self-supervised learning with transformers can effectively capture both global and local musical properties, challenging the belief that more complex SSL methods are needed for local MIR tasks.

Contribution

It reveals the potential of simple contrastive SSL paired with a transformer to learn meaningful musical features for local MIR tasks, without specialized training.

Findings

01

Sequence tokens perform well on local tasks despite simple training.

02

Layer-wise attention maps reveal emergence of musical features like onsets.

03

Different layers capture distinct musical dimensions.

Abstract

In music information retrieval (MIR), contrastive self-supervised learning for general-purpose representation models is effective for global tasks such as automatic tagging. However, for local tasks such as chord estimation, it is widely assumed that contrastively trained general-purpose self-supervised models are inadequate and that more sophisticated SSL is necessary; e.g., masked modeling. Our paper challenges this assumption by revealing the potential of contrastive SSL paired with a transformer in local MIR tasks. We consider a lightweight vision transformer with one-dimensional patches in the time--frequency domain (ViT-1D) and train it with simple contrastive SSL through normalized temperature-scaled cross-entropy loss (NT-Xent). Although NT-Xent operates only over the class token, we observe that, potentially thanks to weight sharing, informative musical properties emerge in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception