TL;DR
This paper introduces a novel cross-modal contrastive learning method that combines multiple music-related data sources to improve music representation quality, outperforming baseline models across several tasks.
Contribution
It presents a new approach that integrates heterogeneous music data sources using contrastive learning to enhance music representations.
Findings
Outperforms baseline audio CNN in all tasks
Achieves comparable performance to state-of-the-art methods
Highlights importance of multiple data sources in training
Abstract
Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such as the audio, interactions between users and songs, or associated genre metadata. Recently, contrastive learning has led to representations that generalize better compared to traditional supervised methods. In this paper, we present a novel approach that combines multiple types of information related to music using cross-modal contrastive learning, allowing us to learn an audio feature from heterogeneous data simultaneously. We align the latent representations obtained from playlists-track interactions, genre metadata, and the tracks' audio, by maximizing the agreement between these modality representations using a contrastive loss. We evaluate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
