VMAS: Video-to-Music Generation via Semantic Alignment in Web Music   Videos

Yan-Bo Lin; Yu Tian; Linjie Yang; Gedas Bertasius; Heng Wang

arXiv:2409.07450·cs.MM·September 12, 2024

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang

PDF

Open Access

TL;DR

This paper introduces VMAS, a novel framework that generates background music from videos by leveraging large-scale web data, semantic alignment, and a new video encoder, resulting in more realistic and diverse music.

Contribution

The paper presents a new video-music Transformer with semantic and beat alignment, and a large dataset for training, advancing the quality and diversity of video-to-music generation.

Findings

01

Outperforms existing methods on DISCO-MV and MusicCaps datasets.

02

Uses a large-scale dataset of 2.2 million video-music samples.

03

Achieves higher human evaluation scores for music realism and relevance.

Abstract

We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies

MethodsByte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Dropout · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Linear Layer · Adam