Emotion-Guided Image to Music Generation
Souraja Kundu, Saket Singh, Yuji Iwahori

TL;DR
This paper introduces an emotion-guided image-to-music generation framework that aligns music with the emotional content of images using a VA loss and Transformer architecture, outperforming previous models.
Contribution
It presents a novel VA loss-based approach with a CNN-Transformer model for emotion-aligned music generation from images, validated on a new dataset.
Findings
Superior performance on emotional and musical metrics
Effective emotional alignment with images
Robust convergence and high-quality MIDI output
Abstract
Generating music from images can enhance various applications, including background music for photo slideshows, social media experiences, and video creation. This paper presents an emotion-guided image-to-music generation framework that leverages the Valence-Arousal (VA) emotional space to produce music that aligns with the emotional tone of a given image. Unlike previous models that rely on contrastive learning for emotional consistency, the proposed approach directly integrates a VA loss function to enable accurate emotional alignment. The model employs a CNN-Transformer architecture, featuring pre-trained CNN image feature extractors and three Transformer encoders to capture complex, high-level emotional features from MIDI music. Three Transformer decoders refine these features to generate musically and emotionally consistent MIDI sequences. Experimental results on a newly curated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAesthetic Perception and Analysis · Color perception and design · Digital Media and Visual Art
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Layer Normalization · Residual Connection · Byte Pair Encoding · Absolute Position Encodings · Multi-Head Attention · Softmax
