MEMA: Multimodal Aesthetic Evaluation of Music in Visual Contexts
Huaye Zhang, Chenglizhao Chen, Mengke Song, Tingting Chen, Diqiong Jiang, Lichun Liu, Xinyu Liu

TL;DR
This paper introduces MEMA, a model that evaluates how well music matches the visual and emotional tone of a video.
Contribution
MEMA introduces a novel two-stage model and dataset for evaluating the aesthetic synergy between music and visuals.
Findings
MEMA achieves 18.137% improvement in LCC and 17.866% in SRCC over existing methods.
The model demonstrates superior alignment of audio and visual narratives.
VMAE-Sets is introduced as the first large-scale dataset for soundtrack aesthetic evaluation.
Abstract
Recent technologies such as music retrieval, soundtrack generation, and video understanding have developed rapidly. Consequently, the aesthetic evaluation of video soundtracks has become an important research topic in academia. Soundtracks are key elements in shaping the emotional atmosphere and driving the narrative rhythm. Therefore, they require systematic methods to assess their artistic coordination with visual content. However, existing approaches mostly focus on evaluating the quality of the music itself. They often lack the ability to model the deeper aesthetic synergy between audio and visuals. To address this gap, we propose MEMA, a new soundtrack aesthetic evaluation model. MEMA employs a two-stage training strategy. The first stage builds a crossmodal imagination mechanism using a Conditional Variational Autoencoder. This method achieves bidirectional semantic reconstruction…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Visual Attention and Saliency Detection · Aesthetic Perception and Analysis
