SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models

Zheng Liu; Hao Liang; Bozhou Li; Wentao Xiong; Chong Chen; Conghui He; Wentao Zhang; Bin Cui

arXiv:2407.20756·cs.CV·August 12, 2025·1 cites

SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models

Zheng Liu, Hao Liang, Bozhou Li, Wentao Xiong, Chong Chen, Conghui He, Wentao Zhang, Bin Cui

PDF

Open Access 1 Repo

TL;DR

SynthVLM introduces a novel method for generating high-quality image-caption datasets using diffusion models, leading to improved vision-language models that outperform existing datasets and models on multiple benchmarks.

Contribution

The paper presents SynthVLM, a new data synthesis approach and dataset that enhance vision-language model training and performance, with state-of-the-art results on VQA and MMLU benchmarks.

Findings

01

SynthVLM-100K dataset outperforms traditional datasets in quality.

02

SynthVLM-based models achieve SOTA on VQA tasks.

03

Models trained with SynthVLM data outperform LLaVA with less pretraining data.

Abstract

Vision-Language Models (VLMs) have recently emerged, demonstrating remarkable vision-understanding capabilities. However, training these models requires large-scale datasets, which brings challenges related to efficiency, effectiveness, and quality of web data. In this paper, we introduce SynthVLM, a new data synthesis and curation method for generating image-caption pairs. Unlike traditional methods, where captions are generated from images, SynthVLM utilizes advanced diffusion models and high-quality captions to synthesize and select images from text captions, thereby creating precisely aligned image-text pairs. We further introduce SynthVLM-100K, a high-quality dataset consisting of 100K curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets. Leveraging this dataset, we develop a new family of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

starriver030515/synthvlm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections