Conditional generation of multi-modal data using constrained embedding space mapping
Subhajit Chaudhury, Sakyasingha Dasgupta, Asim Munawar, Md. A. Salam, Khan, Ryuki Tachibana

TL;DR
This paper introduces a conditional generative model that maps multi-modal data embeddings into a shared latent space, enabling cross-modal generation and inference, demonstrated on colored double MNIST data from text and speech inputs.
Contribution
The paper proposes a novel constrained embedding space mapping approach with separate latent spaces for each modality, allowing flexible conditional inference and improved multi-modal data generation.
Findings
Successfully learned joint representations for multi-modal data.
Generalized to generate colored double MNIST digits from text and speech.
Demonstrated effective cross-modal concept learning.
Abstract
We present a conditional generative model that maps low-dimensional embeddings of multiple modalities of data to a common latent space hence extracting semantic relationships between them. The embedding specific to a modality is first extracted and subsequently a constrained optimization procedure is performed to project the two embedding spaces to a common manifold. The individual embeddings are generated back from this common latent space. However, in order to enable independent conditional inference for separately extracting the corresponding embeddings from the common latent space representation, we deploy a proxy variable trick - wherein, the single shared latent space is replaced by the respective separate latent spaces of each modality. We design an objective function, such that, during training we can force these separate spaces to lie close to each other, by minimizing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Human Motion and Animation
