Cross-Modal Contrastive Representation Learning for Audio-to-Image Generation
HaeChun Chung, JooYong Shim, Jong-Kook Kim

TL;DR
This paper introduces CMCRL, a novel method for cross-modal audio-to-image generation that leverages contrastive learning to improve the quality of generated images by extracting useful audio features.
Contribution
The paper proposes a new contrastive learning framework for audio-to-image generation, enhancing feature extraction and image quality over previous methods.
Findings
CMCRL improves image quality in audio-to-image generation.
Experimental results outperform previous approaches.
Contrastive learning effectively extracts useful features from audio data.
Abstract
Multiple modalities for certain information provide a variety of perspectives on that information, which can improve the understanding of the information. Thus, it may be crucial to generate data of different modality from the existing data to enhance the understanding. In this paper, we investigate the cross-modal audio-to-image generation problem and propose Cross-Modal Contrastive Representation Learning (CMCRL) to extract useful features from audios and use it in the generation phase. Experimental results show that CMCRL enhances quality of images generated than previous research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
