Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment

Jiaying Hong; Ting Zhu; Thanet Markchom; Huizhi Liang

arXiv:2512.00120·cs.SD·December 2, 2025

Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment

Jiaying Hong, Ting Zhu, Thanet Markchom, Huizhi Liang

PDF

Open Access

TL;DR

Art2Music is a novel framework that generates feeling-aligned music from artistic images and comments without explicit emotion labels, using a multi-stage process and a new dataset, improving perceptual naturalness and semantic consistency.

Contribution

It introduces ArtiCaps, a pseudo feeling-aligned dataset, and proposes a lightweight, multi-modal music generation method that achieves high-quality, feeling-aligned audio from images and text.

Findings

01

Significant improvements in audio quality metrics.

02

Effective cross-modal feeling alignment verified by human ratings.

03

Robust performance with limited training data.

Abstract

With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Aesthetic Perception and Analysis · Music and Audio Processing