Cross-Modal Music-Video Recommendation: A Study of Design Choices
Laure Pretet, Gael Richard, Geoffroy Peeters

TL;DR
This paper investigates cross-modal music-video recommendation using self-supervised learning, demonstrating that learned audio embeddings and specific loss functions significantly enhance recommendation accuracy on the Music Video Dataset.
Contribution
It introduces an improved cross-modal recommendation framework leveraging pre-trained audio embeddings and validates the effectiveness of a triplet loss over traditional binary cross-entropy.
Findings
Pre-trained audio embeddings improve recommendation performance.
Triplet loss outperforms binary cross-entropy in this setting.
Using learned audio representations enhances cross-modal retrieval accuracy.
Abstract
In this work, we study music/video cross-modal recommendation, i.e. recommending a music track for a video or vice versa. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. More precisely, we jointly learn audio and video embeddings by using their co-occurrence in music-video clips. In this work, we build upon a recent video-music retrieval system (the VM-NET), which originally relies on an audio representation obtained by a set of statistics computed over handcrafted features. We demonstrate here that using audio representation learning such as the audio embeddings provided by the pre-trained MuSimNet, OpenL3, MusicCNN or by AudioSet, largely improves recommendations. We also validate the use of the cross-modal triplet loss originally proposed in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
