Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation
Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao, Jianmin Bao,, Dong Chen, Baining Guo

TL;DR
This paper demonstrates that simple feature distillation can significantly enhance the fine-tuning performance of various pre-training methods, making contrastive learning competitive with masked image modeling and setting new benchmarks.
Contribution
The authors introduce a feature distillation technique that improves the optimization friendliness of representations, boosting fine-tuning performance across multiple pre-training approaches.
Findings
Contrastive learning methods become competitive with MIM after feature distillation.
CLIP ViT-L achieves 89.0% top-1 accuracy on ImageNet-1K after enhancement.
State-of-the-art results on ADE20K and COCO benchmarks with SwinV2-G.
Abstract
Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training · Mutual Information Machine/Mask Image Modeling
