Large-scale Bilingual Language-Image Contrastive Learning
Byungsoo Ko, Geonmo Gu

TL;DR
This paper presents KELIP, a large-scale bilingual Korean-English multimodal model trained on 1.1 billion image-text pairs, demonstrating effective training schemes and exploring cultural and semantic aspects of multimodal learning.
Contribution
The work introduces a large-scale bilingual multimodal dataset and a model with simple training schemes, revealing insights into cross-lingual and cultural semantic learning.
Findings
Training schemes like MAE and multi-crop augmentation improve performance.
Multimodal models can learn cross-lingual relations without explicit cross-lingual training.
KELIP captures cultural differences in visual semantics.
Abstract
This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMasked autoencoder
