TL;DR
This paper introduces MOVE, a musically-motivated embedding method that enhances accuracy and scalability in version identification by employing innovative representations, a triplet loss, and data augmentation, achieving state-of-the-art results.
Contribution
MOVE is the first approach to combine musically-motivated embeddings with scalable triplet loss training for version identification.
Findings
Achieves state-of-the-art performance on benchmark datasets
Demonstrates the effectiveness of temporal content summarization
Shows the impact of embedding dimensionality on performance
Abstract
The version identification (VI) task deals with the automatic detection of recordings that correspond to the same underlying musical piece. Despite many efforts, VI is still an open problem, with much room for improvement, specially with regard to combining accuracy and scalability. In this paper, we present MOVE, a musically-motivated method for accurate and scalable version identification. MOVE achieves state-of-the-art performance on two publicly-available benchmark sets by learning scalable embeddings in an Euclidean distance space, using a triplet loss and a hard triplet mining strategy. It improves over previous work by employing an alternative input representation, and introducing a novel technique for temporal content summarization, a standardized latent space, and a data augmentation strategy specifically designed for VI. In addition to the main results, we perform an ablation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTriplet Loss
