MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Suhwan Choi; Kyu Won Kim; Myungjoo Kang

arXiv:2501.01094·cs.SD·November 21, 2025

MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Suhwan Choi, Kyu Won Kim, Myungjoo Kang

PDF

Open Access

TL;DR

This paper presents MMVA, a framework for multimodal emotional content matching across images, music, and captions using valence and arousal, supported by an expanded dataset, achieving state-of-the-art results.

Contribution

Introduces MMVA, a novel tri-modal encoder framework leveraging valence and arousal for emotional content matching across multiple modalities, along with an expanded dataset IMEMNet-C.

Findings

01

Achieves state-of-the-art valence-arousal prediction performance.

02

Effective in zero-shot multimodal matching tasks.

03

Demonstrates the utility of continuous valence-arousal scores for training.

Abstract

We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing