Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning
Bang Yang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, Yuexian Zou

TL;DR
This paper introduces CLL-CLIP, a continual learning approach for multilingual vision-language models that incrementally expands language capabilities without catastrophic forgetting, demonstrated on a new 36-language benchmark.
Contribution
It proposes a novel continual language learning method for VL models, with an expandable token embedding layer and regularization techniques to prevent forgetting.
Findings
CLL-CLIP improves multilingual image-text retrieval performance.
The approach boosts state-of-the-art methods by up to 6.7% in Recall@1.
Constructed a comprehensive 36-language benchmark for evaluation.
Abstract
While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecond Language Learning and Teaching · EFL/ESL Teaching and Learning
MethodsContrastive Language-Image Pre-training
