Large-scale Bilingual Language-Image Contrastive Learning

Byungsoo Ko; Geonmo Gu

arXiv:2203.14463·cs.CV·April 18, 2022·6 cites

Large-scale Bilingual Language-Image Contrastive Learning

Byungsoo Ko, Geonmo Gu

PDF

Open Access 1 Repo

TL;DR

This paper presents KELIP, a large-scale bilingual Korean-English multimodal model trained on 1.1 billion image-text pairs, demonstrating effective training schemes and exploring cultural and semantic aspects of multimodal learning.

Contribution

The work introduces a large-scale bilingual multimodal dataset and a model with simple training schemes, revealing insights into cross-lingual and cultural semantic learning.

Findings

01

Training schemes like MAE and multi-crop augmentation improve performance.

02

Multimodal models can learn cross-lingual relations without explicit cross-lingual training.

03

KELIP captures cultural differences in visual semantics.

Abstract

This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

navervision/kelip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsMasked autoencoder