Cross-Modal Music-Video Recommendation: A Study of Design Choices

Laure Pretet; Gael Richard; Geoffroy Peeters

arXiv:2104.14799·cs.MM·May 3, 2021

Cross-Modal Music-Video Recommendation: A Study of Design Choices

Laure Pretet, Gael Richard, Geoffroy Peeters

PDF

TL;DR

This paper investigates cross-modal music-video recommendation using self-supervised learning, demonstrating that learned audio embeddings and specific loss functions significantly enhance recommendation accuracy on the Music Video Dataset.

Contribution

It introduces an improved cross-modal recommendation framework leveraging pre-trained audio embeddings and validates the effectiveness of a triplet loss over traditional binary cross-entropy.

Findings

01

Pre-trained audio embeddings improve recommendation performance.

02

Triplet loss outperforms binary cross-entropy in this setting.

03

Using learned audio representations enhances cross-modal retrieval accuracy.

Abstract

In this work, we study music/video cross-modal recommendation, i.e. recommending a music track for a video or vice versa. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. More precisely, we jointly learn audio and video embeddings by using their co-occurrence in music-video clips. In this work, we build upon a recent video-music retrieval system (the VM-NET), which originally relies on an audio representation obtained by a set of statistics computed over handcrafted features. We demonstrate here that using audio representation learning such as the audio embeddings provided by the pre-trained MuSimNet, OpenL3, MusicCNN or by AudioSet, largely improves recommendations. We also validate the use of the cross-modal triplet loss originally proposed in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.