A Progressive Framework of Vision-language Knowledge Distillation and   Alignment for Multilingual Scene

Wenbo Zhang; Yifan Zhang; Jianfeng Lin; Binqiang Huang; Jinlu Zhang,; Wenhao Yu

arXiv:2404.11249·cs.CV·April 18, 2024·1 cites

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

Wenbo Zhang, Yifan Zhang, Jianfeng Lin, Binqiang Huang, Jinlu Zhang,, Wenhao Yu

PDF

Open Access

TL;DR

This paper introduces DC-CLIP, a lightweight multilingual vision-language model trained through a two-stage distillation and alignment process, achieving high performance in English and Chinese with reduced resource requirements.

Contribution

It presents a simple, effective framework for multilingual CLIP compression, enabling deployment on resource-constrained devices with competitive accuracy.

Findings

01

DC-CLIP outperforms existing models in zero-shot image classification.

02

The model achieves high accuracy with less training data.

03

The training framework effectively aligns visual and multilingual textual features.

Abstract

Pre-trained vision-language (V-L) models such as CLIP have shown excellent performance in many downstream cross-modal tasks. However, most of them are only applicable to the English context. Subsequent research has focused on this problem and proposed improved models, such as CN-CLIP and AltCLIP, to facilitate their applicability to Chinese and even other languages. Nevertheless, these models suffer from high latency and a large memory footprint in inference, which limits their further deployment on resource-constrained edge devices. In this work, we propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context. In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Semantic Web and Ontologies

MethodsAltCLIP · Contrastive Language-Image Pre-training