Multi Agents Semantic Emotion Aligned Music to Image Generation with Music Derived Captions
Junchang Shi, Gang Li

TL;DR
This paper introduces MESA MIG, a multi-agent framework that generates semantically and emotionally aligned images from music by producing structured captions and refining images with specialized agents, outperforming baselines in quality and emotional consistency.
Contribution
The paper presents a novel multi-agent system that jointly models semantic and emotional alignment between music and images, incorporating caption generation, refinement, and affective state prediction.
Findings
Outperforms caption-only and single-agent baselines in aesthetic quality and semantic consistency.
Achieves state-of-the-art emotion regression performance for music and images.
Demonstrates effective alignment of music-derived emotions with generated images.
Abstract
When people listen to music, they often experience rich visual imagery. We aim to externalize this inner imagery by generating images conditioned on music. We propose MESA MIG, a multi agent semantic and emotion aligned framework that first produces structured music captions and then refines them with cooperating agents specializing in scene, motion, style, color, and composition. In parallel, a Valence Arousal regression head predicts continuous affective states from music, while a CLIP based visual VA head estimates emotions from images. These components jointly enforce semantic and emotional alignment between music and synthesized images. Experiments on curated music image pairs show that MESA MIG outperforms caption only and single agent baselines in aesthetic quality, semantic consistency, and VA alignment, and achieves competitive emotion regression performance compared with state…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Emotion and Mood Recognition
