TL;DR
MEDiC is a multi-objective framework combining pixel, patch, and global feature distillation from CLIP, achieving state-of-the-art accuracy on ImageNet-1K with a systematic exploration of design choices.
Contribution
The paper introduces MEDiC, a novel multi-objective distillation framework from CLIP that combines multiple learning objectives and investigates their interactions.
Findings
All three objectives provide complementary information.
Hierarchical clustering with evolved masking does not outperform simple block masking.
Optimal loss weights are highly sensitive, with small changes causing significant performance drops.
Abstract
Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hierarchical clustering with relative position bias for evolved masking and find that, despite producing more semantically coherent masks than prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
