Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency
Tianhong Li, Sangnie Bhardwaj, Yonglong Tian, Han Zhang, Jarred, Barber, Dina Katabi, Guillaume Lajoie, Huiwen Chang, Dilip Krishnan

TL;DR
This paper introduces ITIT, a cycle consistency-based training framework that enables vision-language models to learn effectively from unpaired image and text data, reducing reliance on costly paired datasets.
Contribution
ITIT is the first framework to leverage cycle consistency for training vision-language models on unpaired data, achieving competitive performance with significantly less paired data.
Findings
ITIT performs comparably to state-of-the-art models using only 3 million paired samples.
The model effectively learns bidirectional image-text generation from unpaired datasets.
Cycle consistency enforces meaningful alignment between images and texts during training.
Abstract
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce (negrating mage ext): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
