Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and   Image Embeddings

Sahand Sharifzadeh; Christos Kaplanis; Shreya Pathak; Dharshan; Kumaran; Anastasija Ilic; Jovana Mitrovic; Charles Blundell; Andrea Banino

arXiv:2403.07750·cs.CV·June 10, 2024·3 cites

Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Sahand Sharifzadeh, Christos Kaplanis, Shreya Pathak, Dharshan, Kumaran, Anastasija Ilic, Jovana Mitrovic, Charles Blundell, Andrea Banino

PDF

Open Access

TL;DR

This paper introduces Synth$^2$, a method that uses synthetic captions and image embeddings generated by LLMs and image models to train visual-language models efficiently, reducing reliance on human-labeled datasets.

Contribution

Synth$^2$ leverages synthetic data creation via LLMs and image generators to improve VLM training, enabling faster and less data-dependent model development.

Findings

01

Synthetic data achieves comparable performance to human-labeled data

02

Synth$^2$ training is 25% faster in embedding space

03

Semantic diversity in captions enhances downstream performance

Abstract

The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). In this work, we investigate an approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator's ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Natural Language Processing Techniques

MethodsSparse Evolutionary Training