Video-based Music Generation

Serkan Sulun

arXiv:2602.07063·cs.LG·February 10, 2026

Video-based Music Generation

Serkan Sulun

PDF

Open Access

TL;DR

This paper introduces EMSYNC, an automatic system that generates emotionally and rhythmically synchronized music for videos using a novel emotion classifier, a large-scale emotion-labeled MIDI dataset, and a new temporal boundary conditioning method.

Contribution

It presents EMSYNC, a comprehensive framework combining emotion classification, nuanced emotion-based music generation, and temporal synchronization, advancing automatic video-based music creation.

Findings

01

Achieved state-of-the-art results on Ekman-6 and MovieNet datasets.

02

User studies favor EMSYNC over existing methods in multiple aspects.

03

Introduced a large-scale emotion-labeled MIDI dataset for affective music generation.

Abstract

As the volume of video content on the internet grows rapidly, finding a suitable soundtrack remains a significant challenge. This thesis presents EMSYNC (EMotion and SYNChronization), a fast, free, and automatic solution that generates music tailored to the input video, enabling content creators to enhance their productions without composing or licensing music. Our model creates music that is emotionally and rhythmically synchronized with the video. A core component of EMSYNC is a novel video emotion classifier. By leveraging pretrained deep neural networks for feature extraction and keeping them frozen while training only fusion layers, we reduce computational complexity while improving accuracy. We show the generalization abilities of our method by obtaining state-of-the-art results on Ekman-6 and MovieNet. Another key contribution is a large-scale, emotion-labeled MIDI dataset for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Emotion and Mood Recognition · Music Technology and Sound Studies