Zero-Shot Distillation for Image Encoders: How to Make Effective Use of   Synthetic Data

Niclas Popp; Jan Hendrik Metzen; Matthias Hein

arXiv:2404.16637·cs.CV·April 26, 2024

Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data

Niclas Popp, Jan Hendrik Metzen, Matthias Hein

PDF

Open Access

TL;DR

This paper proposes a method for training smaller image encoders using synthetic data and L2 distillation loss, enabling effective zero-shot classification with significantly fewer parameters, addressing generalization issues in contrastive learning.

Contribution

It introduces an L2 distillation approach that improves zero-shot generalization of compact image encoders trained on synthetic data, outperforming contrastive loss methods.

Findings

01

Achieves zero-shot performance comparable to larger models on multiple datasets.

02

Reduces model size by up to 92% while maintaining accuracy.

03

Identifies spurious feature exploitation as a key challenge in synthetic data distillation.

Abstract

Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While existing approaches have scaled down the entire CLIP architecture, we focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance. However, we find that this approach surprisingly fails in true zero-shot settings when using contrastive losses. We identify the exploitation of spurious features as being responsible for poor generalization between synthetic and real data. However, by using the image feature-based L2 distillation loss, we mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors

MethodsContrastive Language-Image Pre-training · Focus