Enhancing CLIP Conceptual Embedding through Knowledge Distillation
Kuei-Chun Kao

TL;DR
This paper introduces Knowledge-CLIP, a novel method that enhances CLIP's multi-modal embedding capabilities by applying knowledge distillation from Llama 2, improving both text and image representations through specialized training objectives.
Contribution
It proposes a new knowledge distillation framework for CLIP that incorporates Llama 2, including text embedding distillation, concept learning via clustering, and contrastive learning.
Findings
Improved performance of text and image encoders.
Effective integration of Llama 2 knowledge into CLIP.
Enhanced multi-modal embedding quality.
Abstract
Recently, CLIP has become an important model for aligning images and text in multi-modal contexts. However, researchers have identified limitations in the ability of CLIP's text and image encoders to extract detailed knowledge from pairs of captions and images. In response, this paper presents Knowledge-CLIP, an innovative approach designed to improve CLIP's performance by integrating a new knowledge distillation (KD) method based on Llama 2. Our approach focuses on three key objectives: Text Embedding Distillation, Concept Learning, and Contrastive Learning. First, Text Embedding Distillation involves training the Knowledge-CLIP text encoder to mirror the teacher model, Llama 2. Next, Concept Learning assigns a soft concept label to each caption-image pair by employing offline K-means clustering on text data from Llama 2, enabling Knowledge-CLIP to learn from these soft concept labels.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
MethodsLLaMA · k-Means Clustering · Contrastive Learning · Knowledge Distillation · Contrastive Language-Image Pre-training
