MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization
Ruiqi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, Zhou Zhao

TL;DR
MuVi is a novel framework for generating music that aligns with video content by analyzing visual semantics, ensuring synchronization, and allowing style control, thereby improving audio-visual cohesion and immersion.
Contribution
Introduces MuVi, a comprehensive video-to-music generation system with semantic analysis, rhythmic synchronization, and style control, advancing the quality and coherence of audio-visual content.
Findings
MuVi achieves superior audio quality and synchronization.
The contrastive pre-training scheme enhances music-visual alignment.
Flow-matching generator enables in-context style and genre control.
Abstract
Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives. This paper presents MuVi, a novel framework that effectively addresses these challenges to enhance the cohesion and immersive experience of audio-visual content. MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features. These features are used to generate music that not only matches the video's mood and theme but also its rhythm and pacing. We also introduce a contrastive music-visual pre-training scheme to ensure synchronization, based on the periodicity nature of music phrases. In addition, we demonstrate that our flow-matching-based music generator has in-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing
