Supervised contrastive learning from weakly-labeled audio segments for musical version matching
Joan Serr\`a, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

TL;DR
This paper introduces a novel supervised contrastive learning approach for musical version matching at the segment level, outperforming existing methods and offering potential applications beyond audio domain.
Contribution
It proposes a new weakly supervised learning method with a contrastive loss variant that improves segment-level musical version matching accuracy.
Findings
Achieved state-of-the-art track-level performance.
Significantly improved segment-level matching results.
Demonstrated the generality of the approach for other domains.
Abstract
Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet losses, disregarding more recent losses that could bring meaningful improvements. In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
