Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
Jiaying Hong, Ting Zhu, Thanet Markchom, Huizhi Liang

TL;DR
Art2Music is a novel framework that generates feeling-aligned music from artistic images and comments without explicit emotion labels, using a multi-stage process and a new dataset, improving perceptual naturalness and semantic consistency.
Contribution
It introduces ArtiCaps, a pseudo feeling-aligned dataset, and proposes a lightweight, multi-modal music generation method that achieves high-quality, feeling-aligned audio from images and text.
Findings
Significant improvements in audio quality metrics.
Effective cross-modal feeling alignment verified by human ratings.
Robust performance with limited training data.
Abstract
With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Aesthetic Perception and Analysis · Music and Audio Processing
