Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries
Serkan Sulun, Paula Viana, Matthew E. P. Davies

TL;DR
This paper presents EMSYNC, an automated system that generates emotionally and temporally aligned music for videos by leveraging a two-stage framework with novel boundary offset mechanisms and emotion mapping schemes.
Contribution
It introduces a new temporal conditioning mechanism and an emotion mapping scheme for improved video soundtrack generation, outperforming existing models.
Findings
Outperforms state-of-the-art models in objective evaluations.
Achieves better emotional and temporal alignment in generated music.
Demonstrates effectiveness across multiple video datasets.
Abstract
Providing soundtracks for videos remains a costly and time-consuming challenge for multimedia content creators. We introduce EMSYNC, an automatic video-based symbolic music generator that creates music aligned with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate upcoming video scene cuts and align generated musical chords with them. We also propose a mapping scheme that bridges the discrete categorical outputs of the video emotion classifier with the continuous valence-arousal inputs required by the emotion-conditioned MIDI generator, enabling seamless integration of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies
MethodsALIGN
