Audio-to-Image Cross-Modal Generation

Maciej \.Zelaszczyk; Jacek Ma\'ndziuk

arXiv:2109.13354·cs.MM·August 16, 2022

Audio-to-Image Cross-Modal Generation

Maciej \.Zelaszczyk, Jacek Ma\'ndziuk

PDF

Open Access

TL;DR

This paper explores cross-modal generative modeling by training variational autoencoders to reconstruct images from audio data, revealing a trade-off between image diversity and consistency while preserving key classification features.

Contribution

It demonstrates the feasibility of audio-to-image generation using VAEs within an adversarial framework, highlighting the balance between diversity and consistency in generated images.

Findings

01

Trade-off between image diversity and consistency controlled by loss scaling

02

Generated images retain critical features for classification despite diversity

03

Adversarial training enhances variability in cross-modal generation

Abstract

Cross-modal representation learning allows to integrate information from different modalities into one representation. At the same time, research on generative models tends to focus on the visual domain with less emphasis on other domains, such as audio or text, potentially missing the benefits of shared representations. Studies successfully linking more than one modality in the generative setting are rare. In this context, we verify the possibility to train variational autoencoders (VAEs) to reconstruct image archetypes from audio data. Specifically, we consider VAEs in an adversarial training framework in order to ensure more variability in the generated data and find that there is a trade-off between the consistency and diversity of the generated images - this trade-off can be governed by scaling the reconstruction loss up or down, respectively. Our results further suggest that even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Generative Adversarial Networks and Image Synthesis