Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, and Hao-Wen Dong

TL;DR
This paper introduces a new dataset and method for generating film music by leveraging video content, improving the alignment of music with visual and emotional cues in movies.
Contribution
The paper presents the Open Screen Soundtrack Library dataset and a video-adapted transformer model for text-to-music generation tailored to film production.
Findings
Enhanced music generation quality with video conditioning
Improved objective and subjective metrics for film music
Public release of dataset and tools for future research
Abstract
Despite recent advancements in music generation systems, their application in film production remains limited, as they struggle to capture the nuances of real-world filmmaking, where filmmakers consider multiple factors-such as visual content, dialogue, and emotional tone-when selecting or composing music for a scene. This limitation primarily stems from the absence of comprehensive datasets that integrate these elements. To address this gap, we introduce Open Screen Soundtrack Library (OSSL), a dataset consisting of movie clips from public domain films, totaling approximately 36.5 hours, paired with high-quality soundtracks and human-annotated mood information. To demonstrate the effectiveness of our dataset in improving the performance of pre-trained models on film music generation tasks, we introduce a new video adapter that enhances an autoregressive transformer-based text-to-music…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Human Motion and Animation
