Semantically Invariant Text-to-Image Generation
Shagan Sah, Dheeraj Peri, Ameya Shringi, Chi Zhang, Miguel Dominguez,, Andreas Savakis, Ray Ptucha

TL;DR
This paper introduces MMVR, a bidirectional multi-modal network for improved text-to-image and image-to-text generation, utilizing semantic similarity and n-gram metrics to enhance image quality and modality integration.
Contribution
The paper presents MMVR, a novel architecture enabling bidirectional image-text generation, with two key improvements: a semantic-aware cost function and the use of multiple similar sentences.
Findings
MMVR improves text-conditioned image generation by over 20%.
The model effectively integrates visual and textual modalities.
Semantic similarity enhances image quality.
Abstract
Image captioning has demonstrated models that are capable of generating plausible text given input images or videos. Further, recent work in image generation has shown significant improvements in image quality when text is used as a prior. Our work ties these concepts together by creating an architecture that can enable bidirectional generation of images and text. We call this network Multi-Modal Vector Representation (MMVR). Along with MMVR, we propose two improvements to the text conditioned image generation. Firstly, a n-gram metric based cost function is introduced that generalizes the caption with respect to the image. Secondly, multiple semantically similar sentences are shown to help in generating better images. Qualitative and quantitative evaluations demonstrate that MMVR improves upon existing text conditioned image generation results by over 20%, while integrating visual and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
