Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
Ivan Rinaldi, Matteo Mendula, Nicola Fanelli, Florence Lev\'e, Matteo Testi, Giovanna Castellano, Gennaro Vessio

TL;DR
This paper introduces ArtToMus, a novel framework for direct artwork-to-music generation using a large-scale artwork-music dataset, enabling music synthesis guided solely by visual features without relying on text or language-based supervision.
Contribution
It presents the first direct artwork-to-music generation framework that maps visual embeddings into a diffusion model without image-to-text translation, supported by a new large-scale dataset.
Findings
Generates musically coherent and stylistically consistent outputs
Achieves competitive perceptual quality despite lower alignment scores
Establishes direct visual-to-music generation as a new research direction
Abstract
Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Multimodal Machine Learning Applications
