Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems
Luis Carvalho, Tobias Wash\"uttl, Gerhard Widmer

TL;DR
This paper explores self-supervised contrastive learning to improve cross-modal music retrieval by pre-training on real music data, significantly enhancing snippet retrieval and piece identification accuracy in audio-sheet music systems.
Contribution
It demonstrates that self-supervised contrastive pre-training on real music data improves cross-modal retrieval performance and generalizes better to real-world scenarios.
Findings
Pre-trained models show better retrieval precision across scenarios.
Retrieval quality improves from 30% to 100% with real data.
Self-supervised learning alleviates data scarcity in music retrieval.
Abstract
Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pre-trained models are able to retrieve snippets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
