Video2Music: Suitable Music Generation from Videos using an Affective   Multimodal Transformer model

Jaeyong Kang; Soujanya Poria; Dorien Herremans

arXiv:2311.00968·cs.SD·June 3, 2024·2 cites

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Jaeyong Kang, Soujanya Poria, Dorien Herremans

PDF

Open Access 1 Repo

TL;DR

This paper introduces Video2Music, a framework that generates emotionally matching music for videos using a novel Affective Multimodal Transformer and a new dataset, advancing video-music alignment technology.

Contribution

The work presents a new multimodal dataset MuVi-Sync and a novel AMT model that enforces affective similarity for video-guided music generation.

Findings

01

Generated music matches video emotion effectively.

02

User study confirms high quality of generated music and video matching.

03

Proposed model outperforms baseline methods in emotion alignment.

Abstract

Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analysed the music videos to obtain semantic, scene offset, motion, and emotion features. These distinct features are then employed as guiding input to our music generation model. We transcribe the audio files into MIDI and chords, and extract features such as note density and loudness. This results in a rich multimodal dataset, called MuVi-Sync, on which we train a novel Affective Multimodal Transformer (AMT) model to generate music given a video. This model includes a novel mechanism to enforce affective similarity between video and music.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amaai-lab/video2music
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Linear Layer · Dropout · Residual Connection · Byte Pair Encoding · Dense Connections · Layer Normalization · Multi-Head Attention · Label Smoothing · Position-Wise Feed-Forward Layer