uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

Dahyun Chung; Donghyun Shin; Yujin Sung; Seunggi Moon; Jinwoo Jeon; Byung-Jun Lee

arXiv:2511.13036·cs.CV·December 9, 2025

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

Dahyun Chung, Donghyun Shin, Yujin Sung, Seunggi Moon, Jinwoo Jeon, Byung-Jun Lee

PDF

Open Access 1 Video

TL;DR

uCLIP introduces a lightweight, parameter-efficient method for extending vision-language models to low-resource languages without requiring additional image-text data, significantly improving multilingual retrieval performance.

Contribution

The paper presents a novel, minimal-training framework that aligns multilingual representations using a small projection module and English anchors, enhancing performance in underrepresented languages.

Findings

01

Significant improvements in retrieval accuracy for five underrepresented languages.

02

Effective multilingual alignment achieved without additional image-text or text-text pairs.

03

The approach is robust and data-efficient, requiring only 1.7M parameters to train.

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques