A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene
Wenbo Zhang, Yifan Zhang, Jianfeng Lin, Binqiang Huang, Jinlu Zhang,, Wenhao Yu

TL;DR
This paper introduces DC-CLIP, a lightweight multilingual vision-language model trained through a two-stage distillation and alignment process, achieving high performance in English and Chinese with reduced resource requirements.
Contribution
It presents a simple, effective framework for multilingual CLIP compression, enabling deployment on resource-constrained devices with competitive accuracy.
Findings
DC-CLIP outperforms existing models in zero-shot image classification.
The model achieves high accuracy with less training data.
The training framework effectively aligns visual and multilingual textual features.
Abstract
Pre-trained vision-language (V-L) models such as CLIP have shown excellent performance in many downstream cross-modal tasks. However, most of them are only applicable to the English context. Subsequent research has focused on this problem and proposed improved models, such as CN-CLIP and AltCLIP, to facilitate their applicability to Chinese and even other languages. Nevertheless, these models suffer from high latency and a large memory footprint in inference, which limits their further deployment on resource-constrained edge devices. In this work, we propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context. In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Semantic Web and Ontologies
MethodsAltCLIP · Contrastive Language-Image Pre-training
