Video-Guided Text-to-Music Generation Using Public Domain Movie Collections

Haven Kim; Zachary Novack; Weihan Xu; Julian McAuley; and Hao-Wen Dong

arXiv:2506.12573·cs.SD·July 1, 2025

Video-Guided Text-to-Music Generation Using Public Domain Movie Collections

Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, and Hao-Wen Dong

PDF

Open Access

TL;DR

This paper introduces a new dataset and method for generating film music by leveraging video content, improving the alignment of music with visual and emotional cues in movies.

Contribution

The paper presents the Open Screen Soundtrack Library dataset and a video-adapted transformer model for text-to-music generation tailored to film production.

Findings

01

Enhanced music generation quality with video conditioning

02

Improved objective and subjective metrics for film music

03

Public release of dataset and tools for future research

Abstract

Despite recent advancements in music generation systems, their application in film production remains limited, as they struggle to capture the nuances of real-world filmmaking, where filmmakers consider multiple factors-such as visual content, dialogue, and emotional tone-when selecting or composing music for a scene. This limitation primarily stems from the absence of comprehensive datasets that integrate these elements. To address this gap, we introduce Open Screen Soundtrack Library (OSSL), a dataset consisting of movie clips from public domain films, totaling approximately 36.5 hours, paired with high-quality soundtracks and human-annotated mood information. To demonstrate the effectiveness of our dataset in improving the performance of pre-trained models on film music generation tasks, we introduce a new video adapter that enhances an autoregressive transformer-based text-to-music…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Music and Audio Processing · Human Motion and Animation