MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts

Huaye Zhang; Chenglizhao Chen; Mengke Song; Tingting Chen; Diqiong Jiang; Lichun Liu; Xinyu Liu

PMC · DOI:10.3390/s26041395·February 23, 2026

MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts

Huaye Zhang, Chenglizhao Chen, Mengke Song, Tingting Chen, Diqiong Jiang, Lichun Liu, Xinyu Liu

PDF

Open Access

TL;DR

This paper introduces MEMA, a model that evaluates how well music matches the visual and emotional tone of a video.

Contribution

MEMA introduces a novel two-stage model and dataset for evaluating the aesthetic synergy between music and visuals.

Findings

01

MEMA achieves 18.137% improvement in LCC and 17.866% in SRCC over existing methods.

02

The model demonstrates superior alignment of audio and visual narratives.

03

VMAE-Sets is introduced as the first large-scale dataset for soundtrack aesthetic evaluation.

Abstract

Recent technologies such as music retrieval, soundtrack generation, and video understanding have developed rapidly. Consequently, the aesthetic evaluation of video soundtracks has become an important research topic in academia. Soundtracks are key elements in shaping the emotional atmosphere and driving the narrative rhythm. Therefore, they require systematic methods to assess their artistic coordination with visual content. However, existing approaches mostly focus on evaluating the quality of the music itself. They often lack the ability to model the deeper aesthetic synergy between audio and visuals. To address this gap, we propose MEMA, a new soundtrack aesthetic evaluation model. MEMA employs a two-stage training strategy. The first stage builds a crossmodal imagination mechanism using a Conditional Variational Autoencoder. This method achieves bidirectional semantic reconstruction…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

CVAE

Diseases6

SRCC injury to TI LLMs NEC LCC

Figures9

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Visual Attention and Saliency Detection · Aesthetic Perception and Analysis