Contrastive Learning for Cross-modal Artist Retrieval
Andres Ferraro, Jaehun Kim, Sergio Oramas, Andreas Ehmann, Fabien, Gouyon

TL;DR
This paper introduces a contrastive learning approach to combine multiple modality embeddings for artist retrieval, improving accuracy and robustness, especially for less popular artists with missing modality data.
Contribution
It presents a novel contrastive learning method that effectively integrates diverse modality embeddings and handles missing data in music artist retrieval tasks.
Findings
Outperforms single-modality embeddings in accuracy and coverage
Significantly benefits retrieval of less popular artists
More robust to missing modality data
Abstract
Music retrieval and recommendation applications often rely on content features encoded as embeddings, which provide vector representations of items in a music dataset. Numerous complementary embeddings can be derived from processing items originally represented in several modalities, e.g., audio signals, user interaction data, or editorial data. However, data of any given modality might not be available for all items in any music dataset. In this work, we propose a method based on contrastive learning to combine embeddings from multiple modalities and explore the impact of the presence or absence of embeddings from diverse modalities in an artist similarity task. Experiments on two datasets suggest that our contrastive method outperforms single-modality embeddings and baseline algorithms for combining modalities, both in terms of artist retrieval accuracy and coverage. Improvements with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
MethodsContrastive Learning
