Multimodal Metric Learning for Tag-based Music Retrieval
Minz Won, Sergio Oramas, Oriol Nieto, Fabien Gouyon, Xavier Serra

TL;DR
This paper introduces a multimodal metric learning approach for tag-based music retrieval, leveraging triplet sampling, acoustic and cultural data, and domain-specific embeddings to improve retrieval accuracy and flexibility.
Contribution
It proposes a novel multimodal metric learning framework for music retrieval that overcomes fixed vocabulary limitations using advanced sampling and domain-specific embeddings.
Findings
Enhanced retrieval performance quantitatively
Improved qualitative retrieval results
Introduction of the MSD500 dataset with tags and user profiles
Abstract
Tag-based music retrieval is crucial to browse large-scale music libraries efficiently. Hence, automatic music tagging has been actively explored, mostly as a classification task, which has an inherent limitation: a fixed vocabulary. On the other hand, metric learning enables flexible vocabularies by using pretrained word embeddings as side information. Also, metric learning has already proven its suitability for cross-modal retrieval tasks in other domains (e.g., text-to-image) by jointly learning a multimodal embedding space. In this paper, we investigate three ideas to successfully introduce multimodal metric learning for tag-based music retrieval: elaborate triplet sampling, acoustic and cultural music information, and domain-specific word embeddings. Our experimental results show that the proposed ideas enhance the retrieval system quantitatively, and qualitatively. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
