Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model
Jaeyong Kang, Soujanya Poria, Dorien Herremans

TL;DR
This paper introduces Video2Music, a framework that generates emotionally matching music for videos using a novel Affective Multimodal Transformer and a new dataset, advancing video-music alignment technology.
Contribution
The work presents a new multimodal dataset MuVi-Sync and a novel AMT model that enforces affective similarity for video-guided music generation.
Findings
Generated music matches video emotion effectively.
User study confirms high quality of generated music and video matching.
Proposed model outperforms baseline methods in emotion alignment.
Abstract
Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analysed the music videos to obtain semantic, scene offset, motion, and emotion features. These distinct features are then employed as guiding input to our music generation model. We transcribe the audio files into MIDI and chords, and extract features such as note density and loudness. This results in a rich multimodal dataset, called MuVi-Sync, on which we train a novel Affective Multimodal Transformer (AMT) model to generate music given a video. This model includes a novel mechanism to enforce affective similarity between video and music.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Linear Layer · Dropout · Residual Connection · Byte Pair Encoding · Dense Connections · Layer Normalization · Multi-Head Attention · Label Smoothing · Position-Wise Feed-Forward Layer
