MeLFusion: Synthesizing Music from Image and Language Cues using   Diffusion Models

Sanjoy Chowdhury; Sayan Nag; K J Joseph; Balaji Vasan Srinivasan,; Dinesh Manocha

arXiv:2406.04673·cs.CV·June 10, 2024

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Sanjoy Chowdhury, Sayan Nag, K J Joseph, Balaji Vasan Srinivasan,, Dinesh Manocha

PDF

Open Access 1 Repo

TL;DR

MeLFusion is a diffusion-based model that synthesizes music from combined textual and visual cues, introducing a novel visual synapse mechanism and a new dataset and evaluation metric for this task.

Contribution

It presents a new multimodal music synthesis model with a visual synapse, along with a dataset and evaluation metric, advancing the integration of visual information in music generation.

Findings

01

Adding visual cues improves music quality significantly.

02

The model achieves up to 67.98% improvement on FAD score.

03

The approach demonstrates the effectiveness of multimodal conditioning in music synthesis.

Abstract

Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

schowdhury671/melfusion
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion