SILC: Improving Vision Language Pretraining with Self-Distillation

Muhammad Ferjad Naeem; Yongqin Xian; Xiaohua Zhai; Lukas Hoyer; Luc; Van Gool; Federico Tombari

arXiv:2310.13355·cs.CV·December 8, 2023·1 cites

SILC: Improving Vision Language Pretraining with Self-Distillation

Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc, Van Gool, Federico Tombari

PDF

Open Access 4 Reviews

TL;DR

SILC enhances vision-language pretraining by incorporating local-to-global self-distillation, significantly improving dense prediction tasks and open vocabulary capabilities over existing contrastive models like CLIP.

Contribution

The paper introduces SILC, a novel self-distillation framework that improves dense prediction and open vocabulary tasks in vision-language pretraining.

Findings

01

Sets new state-of-the-art in zero-shot and few-shot classification

02

Improves dense prediction tasks like detection and segmentation

03

Enhances open vocabulary detection, captioning, and VQA

Abstract

Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective used by these models only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we introduce SILC, a novel framework for vision language pretraining. SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The method is simple and effective. Although it does not introduce novel objectives for pre-training, it shows a simple combination of CLIP and DINO can work well at a relatively large scale. 2. The presentation is clear and easy to follow.

Weaknesses

1. The experiments are not convincing. In Table 1, the authors reproduce several baseline methods for comparison under a comparable setting. However, in Table 2, only the reproduced CLIP has been compared. In Table 3, only open-sourced CLIP has been compared. The adaptation costs of those reproduced methods are relatively small compared to the pre-training costs. I think the authors should also compare with those baseline methods in Table 2 and 3. 2. For large-scale pre-training, computation eff

Reviewer 02Rating 3· reject, not good enoughConfidence 5

Strengths

1. The method in this paper is simple and has shown to be effective on several downstream tasks including image classification, retrieval, and segmentation. 2. The authors have provided ablation studies on the components of the method and verified the effectiveness of the self-distillation across many downstream tasks. 3. The authors prove better scalability of the method over baseline image-text pre-training on the zero-shot classification on the ImageNet dataset.

Weaknesses

1. The motivation is somewhat unaligned with the experiments. The authors discussed the challenge of open-vocabulary dense prediction tasks including image segmentation and object detection. Conceptually, the proposed method would be established as a solution to the mainstream open-vocabulary dense prediction tasks including semantic segmentation, instance segmentation, and object detection, by imposing the global-local consistency that has been proven effective for these tasks in the literature

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. SILC applies local-to-global correspondence learning by self-distillation as an additional objective for contrastive pre-training. 2. Distilling local image features from an EMA teacher model improves model performance on various computer vision tasks. 3. SILC sets a new state of the art for zero-shot classification, few-shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation.

Weaknesses

1. The novelty of the paper is a little bit concerning, as the local-global consistency training has already been widely used in a number of works (e.g. Caron et al., 2021; Oquab et al., 2023; Zhou et al., 2022b). EMA is also a very common technique widely used in self-/semi- supervised learning community. The limitation is also pointed out by the paper. 2. SILC claims to improve vision-language pretraining models for “dense prediction tasks”. However, only segmentation related tasks are validat

Reviewer 04Rating 3· reject, not good enoughConfidence 4

Strengths

This work focuses on improving the performance of vision-language models on dense prediction tasks by designing a pre-training strategy, which is critical in the real-world applications. Experiments on both global recognition and dense prediction tasks are conducted and reported.

Weaknesses

The important technical details, especially about the local crop, should be further clarified and explained. - The operation details of how to obtain the local crop are important and expected to be clarified, since only applying additional views has introduced significant performance improvement as shown in Table 4. - The explanation of why the performance is improved by enforcing the embeddings of local crop to be similar to the teacher prediction. As the local view is ``a random crop over a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsContrastive Learning · Contrastive Language-Image Pre-training