CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions

Yuchen Huang; Zhiyuan Fan; Zhitao He; Sandeep Polisetty; Wenyan Li; Yi R. Fung

arXiv:2507.06210·cs.CV·July 17, 2025

CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions

Yuchen Huang, Zhiyuan Fan, Zhitao He, Sandeep Polisetty, Wenyan Li, Yi R. Fung

PDF

Open Access 1 Repo

TL;DR

This paper introduces CultureCLIP, a culturally aware vision-language model trained on a synthetic dataset to better recognize nuanced cultural differences while maintaining generalization.

Contribution

We create CulTwin, a synthetic cultural dataset, and fine-tune CLIP to improve its ability to distinguish subtle cultural concepts using contextualized captions and synthetic images.

Findings

01

Up to 5.49% improvement in fine-grained cultural concept recognition

02

Outperforms base CLIP on culture-specific benchmarks

03

Preserves generalization ability of CLIP

Abstract

Pretrained vision-language models (VLMs) such as CLIP excel in general multimodal comprehension but often struggle to capture nuanced, context-dependent visual cues. This makes it difficult to distinguish between similar-looking concepts with potentially different cultural meanings. Such deficiencies are mainly due to a limited amount of high-quality cultural data, contextual information, and the lack of negative examples that highlight subtle differences. To mitigate this, we design a data curation pipeline leveraging open-sourced VLMs and text-to-image models to construct CulTwin, a synthetic cultural dataset. This dataset consists of paired concept-caption-image triplets, where concepts visually resemble each other but are culturally different. Then, we fine-tune CLIP on CulTwin to develop CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lukahhcm/cultureclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsDiffusion · Balanced Selection · Contrastive Language-Image Pre-training