Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
Sahand Sharifzadeh, Christos Kaplanis, Shreya Pathak, Dharshan, Kumaran, Anastasija Ilic, Jovana Mitrovic, Charles Blundell, Andrea Banino

TL;DR
This paper introduces Synth$^2$, a method that uses synthetic captions and image embeddings generated by LLMs and image models to train visual-language models efficiently, reducing reliance on human-labeled datasets.
Contribution
Synth$^2$ leverages synthetic data creation via LLMs and image generators to improve VLM training, enabling faster and less data-dependent model development.
Findings
Synthetic data achieves comparable performance to human-labeled data
Synth$^2$ training is 25% faster in embedding space
Semantic diversity in captions enhances downstream performance
Abstract
The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). In this work, we investigate an approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator's ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
