GMAIL: Generative Modality Alignment for generated Image Learning
Shentong Mo, Sukmin Yun

TL;DR
GMAIL introduces a novel framework that treats generated images as a separate modality from real images, aligning them in a shared latent space to improve vision-language task performance.
Contribution
The paper proposes a multi-modal learning approach that explicitly aligns generated images with real images in latent space, enhancing the utility of synthetic data for vision-language tasks.
Findings
Significant improvements in image captioning and retrieval tasks.
Effective modality alignment boosts performance across multiple vision-language benchmarks.
Positive scaling trends with increased generated data volume.
Abstract
Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
