Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach
Yuanxiang Huangfu, Chaochao Wang, Weilei Wang

TL;DR
Role-SynthCLIP introduces a role-playing prompt-based synthetic data generation method that enhances semantic diversity and image-text alignment, significantly improving CLIP model performance with fewer training pairs.
Contribution
The paper presents a novel role-playing prompt framework for generating diverse, high-quality synthetic image-caption pairs to improve CLIP training.
Findings
Achieves 64.1% R@1 on MS COCO with only 1 million pairs.
Outperforms existing synthetic data methods trained on larger datasets.
Enhances semantic diversity and caption quality in synthetic data.
Abstract
The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
