Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

Yuanxiang Huangfu; Chaochao Wang; Weilei Wang

arXiv:2511.05057·cs.CV·November 10, 2025

Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

Yuanxiang Huangfu, Chaochao Wang, Weilei Wang

PDF

Open Access

TL;DR

Role-SynthCLIP introduces a role-playing prompt-based synthetic data generation method that enhances semantic diversity and image-text alignment, significantly improving CLIP model performance with fewer training pairs.

Contribution

The paper presents a novel role-playing prompt framework for generating diverse, high-quality synthetic image-caption pairs to improve CLIP training.

Findings

01

Achieves 64.1% R@1 on MS COCO with only 1 million pairs.

02

Outperforms existing synthetic data methods trained on larger datasets.

03

Enhances semantic diversity and caption quality in synthetic data.

Abstract

The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques