SILC: Improving Vision Language Pretraining with Self-Distillation
Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc, Van Gool, Federico Tombari

TL;DR
SILC enhances vision-language pretraining by incorporating local-to-global self-distillation, significantly improving dense prediction tasks and open vocabulary capabilities over existing contrastive models like CLIP.
Contribution
The paper introduces SILC, a novel self-distillation framework that improves dense prediction and open vocabulary tasks in vision-language pretraining.
Findings
Sets new state-of-the-art in zero-shot and few-shot classification
Improves dense prediction tasks like detection and segmentation
Enhances open vocabulary detection, captioning, and VQA
Abstract
Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective used by these models only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we introduce SILC, a novel framework for vision language pretraining. SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The method is simple and effective. Although it does not introduce novel objectives for pre-training, it shows a simple combination of CLIP and DINO can work well at a relatively large scale. 2. The presentation is clear and easy to follow.
1. The experiments are not convincing. In Table 1, the authors reproduce several baseline methods for comparison under a comparable setting. However, in Table 2, only the reproduced CLIP has been compared. In Table 3, only open-sourced CLIP has been compared. The adaptation costs of those reproduced methods are relatively small compared to the pre-training costs. I think the authors should also compare with those baseline methods in Table 2 and 3. 2. For large-scale pre-training, computation eff
1. The method in this paper is simple and has shown to be effective on several downstream tasks including image classification, retrieval, and segmentation. 2. The authors have provided ablation studies on the components of the method and verified the effectiveness of the self-distillation across many downstream tasks. 3. The authors prove better scalability of the method over baseline image-text pre-training on the zero-shot classification on the ImageNet dataset.
1. The motivation is somewhat unaligned with the experiments. The authors discussed the challenge of open-vocabulary dense prediction tasks including image segmentation and object detection. Conceptually, the proposed method would be established as a solution to the mainstream open-vocabulary dense prediction tasks including semantic segmentation, instance segmentation, and object detection, by imposing the global-local consistency that has been proven effective for these tasks in the literature
1. SILC applies local-to-global correspondence learning by self-distillation as an additional objective for contrastive pre-training. 2. Distilling local image features from an EMA teacher model improves model performance on various computer vision tasks. 3. SILC sets a new state of the art for zero-shot classification, few-shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation.
1. The novelty of the paper is a little bit concerning, as the local-global consistency training has already been widely used in a number of works (e.g. Caron et al., 2021; Oquab et al., 2023; Zhou et al., 2022b). EMA is also a very common technique widely used in self-/semi- supervised learning community. The limitation is also pointed out by the paper. 2. SILC claims to improve vision-language pretraining models for “dense prediction tasks”. However, only segmentation related tasks are validat
This work focuses on improving the performance of vision-language models on dense prediction tasks by designing a pre-training strategy, which is critical in the real-world applications. Experiments on both global recognition and dense prediction tasks are conducted and reported.
The important technical details, especially about the local crop, should be further clarified and explained. - The operation details of how to obtain the local crop are important and expected to be clarified, since only applying additional views has introduced significant performance improvement as shown in Table 4. - The explanation of why the performance is improved by enforcing the embeddings of local crop to be similar to the teacher prediction. As the local view is ``a random crop over a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsContrastive Learning · Contrastive Language-Image Pre-training
