Cross-modal Variational Auto-encoder for Content-based Micro-video   Background Music Recommendation

Jing Yi; Yaochen Zhu; Jiayi Xie; Zhenzhong Chen

arXiv:2107.07268·cs.MM·December 13, 2022

Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation

Jing Yi, Yaochen Zhu, Jiayi Xie, Zhenzhong Chen

PDF

TL;DR

This paper introduces CMVAE, a hierarchical Bayesian model that aligns micro-videos with suitable background music by projecting multimodal data into a shared space, improving recommendation accuracy.

Contribution

The paper presents a novel cross-modal variational auto-encoder with a PoE-based fusion mechanism and a large-scale dataset for micro-video music recommendation.

Findings

01

CMVAE outperforms existing methods on TT-150k dataset.

02

The PoE fusion improves robustness by weighting modalities according to noise levels.

03

Qualitative analysis shows meaningful and accurate recommendations.

Abstract

In this paper, we propose a cross-modal variational auto-encoder (CMVAE) for content-based micro-video background music recommendation. CMVAE is a hierarchical Bayesian generative model that matches relevant background music to a micro-video by projecting these two multimodal inputs into a shared low-dimensional latent space, where the alignment of two corresponding embeddings of a matched video-music pair is achieved by cross-generation. Moreover, the multimodal information is fused by the product-of-experts (PoE) principle, where the semantic information in visual and textual modalities of the micro-video are weighted according to their variance estimations such that the modality with a lower noise level is given more weights. Therefore, the micro-video latent variables contain less irrelevant information that results in a more robust model generalization. Furthermore, we establish a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.