VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang

TL;DR
This paper introduces VMAS, a novel framework that generates background music from videos by leveraging large-scale web data, semantic alignment, and a new video encoder, resulting in more realistic and diverse music.
Contribution
The paper presents a new video-music Transformer with semantic and beat alignment, and a large dataset for training, advancing the quality and diversity of video-to-music generation.
Findings
Outperforms existing methods on DISCO-MV and MusicCaps datasets.
Uses a large-scale dataset of 2.2 million video-music samples.
Achieves higher human evaluation scores for music realism and relevance.
Abstract
We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies
MethodsByte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Dropout · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Linear Layer · Adam
