Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Ivan Rinaldi; Matteo Mendula; Nicola Fanelli; Florence Lev\'e; Matteo Testi; Giovanna Castellano; Gennaro Vessio

arXiv:2602.17599·cs.CV·February 20, 2026

Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Ivan Rinaldi, Matteo Mendula, Nicola Fanelli, Florence Lev\'e, Matteo Testi, Giovanna Castellano, Gennaro Vessio

PDF

Open Access

TL;DR

This paper introduces ArtToMus, a novel framework for direct artwork-to-music generation using a large-scale artwork-music dataset, enabling music synthesis guided solely by visual features without relying on text or language-based supervision.

Contribution

It presents the first direct artwork-to-music generation framework that maps visual embeddings into a diffusion model without image-to-text translation, supported by a new large-scale dataset.

Findings

01

Generates musically coherent and stylistically consistent outputs

02

Achieves competitive perceptual quality despite lower alignment scores

03

Establishes direct visual-to-music generation as a new research direction

Abstract

Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Multimodal Machine Learning Applications