TL;DR
This paper presents SynVAE, a model that translates visual art into music by learning a shared latent space, achieving high consistency and human-matching accuracy without paired datasets.
Contribution
The research introduces SynVAE, a novel variational autoencoder that enables cross-modal translation between images and music without requiring paired training data.
Findings
SynVAE maintains high information content during translation.
The model achieves up to 73% accuracy in human matching tasks.
It demonstrates effective cross-modal latent space consistency.
Abstract
The Synesthetic Variational Autoencoder (SynVAE) introduced in this research is able to learn a consistent mapping between visual and auditive sensory modalities in the absence of paired datasets. A quantitative evaluation on MNIST as well as the Behance Artistic Media dataset (BAM) shows that SynVAE is capable of retaining sufficient information content during the translation while maintaining cross-modal latent space consistency. In a qualitative evaluation trial, human evaluators were furthermore able to match musical samples with the images which generated them with accuracies of up to 73%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSolana Customer Service Number +1-833-534-1729
