SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data
Liqian Zhang, Magdalena Fuentes

TL;DR
SONIQUE is a novel model that generates customizable background music for videos using unpaired audio-visual data, large language models, and diffusion techniques, enabling flexible and user-controlled music creation.
Contribution
It introduces a new approach that leverages unpaired data and LLMs for video understanding to generate tailored music, unlike traditional paired dataset methods.
Findings
Enables user control over music attributes like instruments and genre
Uses unpaired data to train a video-to-music generation model
Open-source implementation with a demo available
Abstract
We present SONIQUE, a model for generating background music tailored to video content. Unlike traditional video-to-music generation approaches, which rely heavily on paired audio-visual datasets, SONIQUE leverages unpaired data, combining royalty-free music and independent video sources. By utilizing large language models (LLMs) for video understanding and converting visual descriptions into musical tags, alongside a U-Net-based conditional diffusion model, SONIQUE enables customizable music generation. Users can control specific aspects of the music, such as instruments, genres, tempo, and melodies, ensuring the generated output fits their creative vision. SONIQUE is open-source, with a demo available online.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
