TL;DR
This paper introduces a self-supervised deep learning method that learns cross-modal embeddings for image-to-text and text-to-image synthesis, reducing reliance on labeled data and improving generation quality.
Contribution
It proposes a novel self-supervised approach using dense embeddings and GANs to learn cross-modal representations for both image and text generation tasks.
Findings
Successfully generates textual descriptions from images.
Generates images from textual descriptions.
Learns meaningful cross-modal embeddings without supervised labels.
Abstract
A comprehensive understanding of vision and language and their interrelation are crucial to realize the underlying similarities and differences between these modalities and to learn more generalized, meaningful representations. In recent years, most of the works related to Text-to-Image synthesis and Image-to-Text generation, focused on supervised generative deep architectures to solve the problems, where very little interest was placed on learning the similarities between the embedding spaces across modalities. In this paper, we propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces; for both image to text and text to image generations. In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
