Emotion-Guided Image to Music Generation

Souraja Kundu; Saket Singh; Yuji Iwahori

arXiv:2410.22299·cs.SD·October 30, 2024

Emotion-Guided Image to Music Generation

Souraja Kundu, Saket Singh, Yuji Iwahori

PDF

Open Access

TL;DR

This paper introduces an emotion-guided image-to-music generation framework that aligns music with the emotional content of images using a VA loss and Transformer architecture, outperforming previous models.

Contribution

It presents a novel VA loss-based approach with a CNN-Transformer model for emotion-aligned music generation from images, validated on a new dataset.

Findings

01

Superior performance on emotional and musical metrics

02

Effective emotional alignment with images

03

Robust convergence and high-quality MIDI output

Abstract

Generating music from images can enhance various applications, including background music for photo slideshows, social media experiences, and video creation. This paper presents an emotion-guided image-to-music generation framework that leverages the Valence-Arousal (VA) emotional space to produce music that aligns with the emotional tone of a given image. Unlike previous models that rely on contrastive learning for emotional consistency, the proposed approach directly integrates a VA loss function to enable accurate emotional alignment. The model employs a CNN-Transformer architecture, featuring pre-trained CNN image feature extractors and three Transformer encoders to capture complex, high-level emotional features from MIDI music. Three Transformer decoders refine these features to generate musically and emotionally consistent MIDI sequences. Experimental results on a newly curated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAesthetic Perception and Analysis · Color perception and design · Digital Media and Visual Art

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Layer Normalization · Residual Connection · Byte Pair Encoding · Absolute Position Encodings · Multi-Head Attention · Softmax