Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren, Zhou, Chang Zhou

TL;DR
This paper introduces Chinese CLIP, a set of large-scale Chinese vision-language models trained on a new dataset, employing a two-stage pretraining method, achieving state-of-the-art results on multiple benchmarks.
Contribution
It constructs a large-scale Chinese image-text dataset and develops multiple Chinese CLIP models with a novel two-stage pretraining approach.
Findings
Chinese CLIP achieves state-of-the-art results on MUGE, Flickr30K-CN, and COCO-CN.
Models perform competitively in zero-shot image classification on ELEVATER.
Two-stage pretraining enhances model performance.
Abstract
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OFA-Sys/chinese-clip-vit-base-patch16model· 183k dl· ♡ 125183k dl♡ 125
- 🤗OFA-Sys/chinese-clip-rn50model· ♡ 6♡ 6
- 🤗OFA-Sys/chinese-clip-vit-large-patch14model· 5.2k dl· ♡ 365.2k dl♡ 36
- 🤗OFA-Sys/chinese-clip-vit-large-patch14-336pxmodel· 555 dl· ♡ 26555 dl♡ 26
- 🤗OFA-Sys/chinese-clip-vit-huge-patch14model· 691 dl· ♡ 30691 dl♡ 30
- 🤗qihoo360/BDM1.0model· 14 dl14 dl
- 🤗Chien0405/distillation_train_cn_clipmodel
- 🤗gongting/chinese-clip-vit-large-patch14-336pxmodel
- 🤗Dimple-sun1/chinese-clip-vit-base-patch16model· 12 dl12 dl
- 🤗richardzhengmedgemma/chinese-clip-rn50model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training · Contrastive Learning
