Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation
Jing Yi, Yaochen Zhu, Jiayi Xie, Zhenzhong Chen

TL;DR
This paper introduces CMVAE, a hierarchical Bayesian model that aligns micro-videos with suitable background music by projecting multimodal data into a shared space, improving recommendation accuracy.
Contribution
The paper presents a novel cross-modal variational auto-encoder with a PoE-based fusion mechanism and a large-scale dataset for micro-video music recommendation.
Findings
CMVAE outperforms existing methods on TT-150k dataset.
The PoE fusion improves robustness by weighting modalities according to noise levels.
Qualitative analysis shows meaningful and accurate recommendations.
Abstract
In this paper, we propose a cross-modal variational auto-encoder (CMVAE) for content-based micro-video background music recommendation. CMVAE is a hierarchical Bayesian generative model that matches relevant background music to a micro-video by projecting these two multimodal inputs into a shared low-dimensional latent space, where the alignment of two corresponding embeddings of a matched video-music pair is achieved by cross-generation. Moreover, the multimodal information is fused by the product-of-experts (PoE) principle, where the semantic information in visual and textual modalities of the micro-video are weighted according to their variance estimations such that the modality with a lower noise level is given more weights. Therefore, the micro-video latent variables contain less irrelevant information that results in a more robust model generalization. Furthermore, we establish a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
