Supervised contrastive learning from weakly-labeled audio segments for musical version matching

Joan Serr\`a; R. Oguz Araz; Dmitry Bogdanov; Yuki Mitsufuji

arXiv:2502.16936·cs.SD·May 19, 2025

Supervised contrastive learning from weakly-labeled audio segments for musical version matching

Joan Serr\`a, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel supervised contrastive learning approach for musical version matching at the segment level, outperforming existing methods and offering potential applications beyond audio domain.

Contribution

It proposes a new weakly supervised learning method with a contrastive loss variant that improves segment-level musical version matching accuracy.

Findings

01

Achieved state-of-the-art track-level performance.

02

Significantly improved segment-level matching results.

03

Demonstrated the generality of the approach for other domains.

Abstract

Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet losses, disregarding more recent losses that could bring meaningful improvements. In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Supervised Contrastive Learning from Weakly-Labeled Audio Segments for Musical Version Matching· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis