Sound-to-Imagination: An Exploratory Study on Unsupervised Crossmodal Translation Using Diverse Audiovisual Data
Leonardo A. Fanzeres, Climent Nadeu

TL;DR
This study explores unsupervised sound-to-image translation using diverse audiovisual data, employing GANs and informativity classifiers to generate semantically coherent images from unknown sounds, achieving over 14% interpretability.
Contribution
It introduces an unsupervised approach for sound-to-image translation with diverse data, utilizing GANs and classifiers for evaluation, advancing beyond simplified prior methods.
Findings
Achieved over 14% interpretable, semantically coherent images from unknown sounds.
Demonstrated a trade-off between informativity and pixel space convergence.
Generalized the model to handle diverse, complex audiovisual data.
Abstract
The motivation of our research is to explore the possibilities of automatic sound-to-image (S2I) translation for enabling a human receiver to visually infer the occurrence of sound related events. We expect the computer to 'imagine' the scene from the captured sound, generating original images that picture the sound emitting source. Previous studies on similar topics opted for simplified approaches using data with low content diversity and/or sound class supervision. Differently, we propose to perform unsupervised S2I translation using thousands of distinct and unknown scenes, with slightly pre-cleaned data, just enough to guarantee aural-visual semantic coherence. To that end, we employ conditional generative adversarial networks (GANs) with a deep densely connected generator. Additionally, we present a solution using informativity classifiers to perform quantitative evaluation of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
