Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data
Niclas Popp, Jan Hendrik Metzen, Matthias Hein

TL;DR
This paper proposes a method for training smaller image encoders using synthetic data and L2 distillation loss, enabling effective zero-shot classification with significantly fewer parameters, addressing generalization issues in contrastive learning.
Contribution
It introduces an L2 distillation approach that improves zero-shot generalization of compact image encoders trained on synthetic data, outperforming contrastive loss methods.
Findings
Achieves zero-shot performance comparable to larger models on multiple datasets.
Reduces model size by up to 92% while maintaining accuracy.
Identifies spurious feature exploitation as a key challenge in synthetic data distillation.
Abstract
Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While existing approaches have scaled down the entire CLIP architecture, we focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance. However, we find that this approach surprisingly fails in true zero-shot settings when using contrastive losses. We identify the exploitation of spurious features as being responsible for poor generalization between synthetic and real data. However, by using the image feature-based L2 distillation loss, we mitigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors
MethodsContrastive Language-Image Pre-training · Focus
