Accessing Vision Foundation Models via ImageNet-1K
Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, Yun Fu

TL;DR
This paper introduces Proteus, a simple method to distill large vision foundation models into smaller, accessible models trained on ImageNet-1K, enabling broader research without access to original training data.
Contribution
Proteus provides a novel, data-efficient distillation approach that removes dataset bias and achieves competitive performance with significantly less training data.
Findings
Proteus-L/14 matches DINOv2-L/14 performance across 19 benchmarks.
Proteus outperforms larger models like CLIP-L/14 and SynCLR-L/14 with fewer training images.
The method enables training foundation models at ImageNet-level costs.
Abstract
Vision foundation models are renowned for the generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could facilitate the research. In this work, we offer a very simple and general solution, named \textit{Proteus}, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for…
Peer Reviews
Decision·ICLR 2025 Poster
Presentation: The paper was written well and was easy to follow. Experiments: Authors evaluated their models on a wide range of CV tasks (classification, segmentation, depth estimation) using several downstream datasets. Experimental results show that models distilled only on ImageNet using the proposed approach match/outperform same size models trained on significantly larger datasets without distillation. Experiments also show that the feature-based distillation performs better than logit-b
``The problem setting of the paper is not convincing`` While the original datasets used to train models such as DINOV2 are unavailable, there are numerous large scale image datasets (DataComp, LAION, etc.) that are available to the community these days. If the goal is to get a small model that performs well on a broad array of tasks, I do not understand why one should restrict themselves to distilling on ImageNet. They can use more diverse datasets, for example, diverse subsets of data taken fr
1. This paper is well-motivated. With the development of foundation model trained with private data, it's difficult for the community to reproduce or compress the foundation model due to the inaccessible training data. Thus, how to use existing academic public datasets as a proxy to achieve comparable performance is an interesting question. 2. The performance of the model is excellent, as it achieve comparable performance with the foundation model with much less training data. Besides, this pa
1. The novelty of this paper is limited. Although the results are promising, the loss at token, feature, and patch level has been widely explored in vision transformer distillation [1,2,3,4]. Besides, all of them can be considered as KD with intermediate features, which has been a common setting in KD. However, this paper does not cite and discuss any recent related works that perform KD with intermediate feature maps. 2. Although this paper presents experimental results to support that empiri
1. This paper presents a simple yet effective method that utilizes the hidden layer features of the teacher network to successfully transfer its generalization ability to the student network. 2. The article provides extensive experimental validation of its claims on the ViT architecture, with datasets that include classification, segmentation, and depth estimation. The student network architecture encompasses various sizes, including ViT-small, ViT-base and ViT-large. 3. This paper is well-wri
1. This paper is less novelty. Previous works have also utilized intermediate layers for distillation on ImageNet-1K and achieved good performance on other datasets, such as MiniViT[1], and [2]. What are the differences between this work and those approaches. What are the comparative results of these methods with your work? 2. Figure 3 is not referenced in the main text, and it is unclear what issue Figure 3 is intended to illustrate. 3. When validating the generalization ability of the studen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUrban Planning and Valuation
MethodsContrastive Language-Image Pre-training · Knowledge Distillation
