
TL;DR
This paper introduces EMSYNC, an automatic system that generates emotionally and rhythmically synchronized music for videos using a novel emotion classifier, a large-scale emotion-labeled MIDI dataset, and a new temporal boundary conditioning method.
Contribution
It presents EMSYNC, a comprehensive framework combining emotion classification, nuanced emotion-based music generation, and temporal synchronization, advancing automatic video-based music creation.
Findings
Achieved state-of-the-art results on Ekman-6 and MovieNet datasets.
User studies favor EMSYNC over existing methods in multiple aspects.
Introduced a large-scale emotion-labeled MIDI dataset for affective music generation.
Abstract
As the volume of video content on the internet grows rapidly, finding a suitable soundtrack remains a significant challenge. This thesis presents EMSYNC (EMotion and SYNChronization), a fast, free, and automatic solution that generates music tailored to the input video, enabling content creators to enhance their productions without composing or licensing music. Our model creates music that is emotionally and rhythmically synchronized with the video. A core component of EMSYNC is a novel video emotion classifier. By leveraging pretrained deep neural networks for feature extraction and keeping them frozen while training only fusion layers, we reduce computational complexity while improving accuracy. We show the generalization abilities of our method by obtaining state-of-the-art results on Ekman-6 and MovieNet. Another key contribution is a large-scale, emotion-labeled MIDI dataset for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Emotion and Mood Recognition · Music Technology and Sound Studies
