uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data
Dahyun Chung, Donghyun Shin, Yujin Sung, Seunggi Moon, Jinwoo Jeon, Byung-Jun Lee

TL;DR
uCLIP introduces a lightweight, parameter-efficient method for extending vision-language models to low-resource languages without requiring additional image-text data, significantly improving multilingual retrieval performance.
Contribution
The paper presents a novel, minimal-training framework that aligns multilingual representations using a small projection module and English anchors, enhancing performance in underrepresented languages.
Findings
Significant improvements in retrieval accuracy for five underrepresented languages.
Effective multilingual alignment achieved without additional image-text or text-text pairs.
The approach is robust and data-efficient, requiring only 1.7M parameters to train.
Abstract
Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
