Contrastive Language-Image Pre-Training with Knowledge Graphs
Xuran Pan, Tianzhu Ye, Dongchen Han, Shiji Song, Gao Huang

TL;DR
This paper introduces Knowledge-CLIP, a pre-training framework that incorporates semantic knowledge graphs into vision-language models to improve semantic alignment and reasoning across modalities.
Contribution
It presents a novel knowledge-based pre-training approach that enhances CLIP by injecting semantic information from knowledge graphs, improving cross-modal understanding.
Findings
Outperforms original CLIP on multiple vision-language tasks.
Enhances semantic alignment and reasoning capabilities.
Demonstrates effectiveness of knowledge integration in pre-training.
Abstract
Recent years have witnessed the fast development of large-scale pre-training frameworks that can extract multi-modal representations in a unified form and achieve promising performances when transferred to downstream tasks. Nevertheless, existing approaches mainly focus on pre-training with simple image-text pairs, while neglecting the semantic connections between concepts from different modalities. In this paper, we propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model. Through introducing knowledge-based objectives in the pre-training process and utilizing different types of knowledge graphs as training data, our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning ability across scenarios and modalities. Extensive experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training · ALIGN
