Exploring Weak-to-Strong Generalization for CLIP-based Classification
Jinhao Li, Sarah M. Erfani, Lei Feng, James Bailey, Feng Liu

TL;DR
This paper introduces class prototype learning (CPL), a method that improves CLIP-based classification by leveraging weak-to-strong generalization, especially effective with limited pretraining, achieving notable accuracy gains.
Contribution
The study extends weak-to-strong generalization to vision-language models and proposes CPL to enhance CLIP's classification through learned prototypes.
Findings
CPL improves classification accuracy by 3.67% over baselines.
Effective especially in limited pretraining scenarios.
Demonstrates robustness of weak supervision in multi-modal models.
Abstract
Aligning large-scale commercial models with user intent is crucial to preventing harmful outputs. Current methods rely on human supervision but become impractical as model complexity increases. When models surpass human knowledge, providing accurate feedback becomes challenging and inefficient. A novel solution proposed recently is using a weaker model to supervise a stronger model. This concept leverages the ability of weaker models to perform evaluations, thereby reducing the workload on human supervisors. Previous work has shown the effectiveness of weak-to-strong generalization in the context of language-only models. Extending this concept to vision-language models leverages these insights, adapting the proven benefits to a multi-modal context. In our study, we explore weak-to-strong generalization for CLIP-based classification. We propose a method, class prototype learning (CPL),…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The idea of weak-to-strong generalization is somewhat meaningful. - The efficiency is improved by removing the text encoder in the inference stage.
- I am quite confused with the 'unlabeled data' setting. Although CPL does not use labeled data, the weak model is trained with ground truth labels. Maybe you just want to verify the idea of 'weak-to-strong'. However, in practice, you will not have a weak model just right having k candidate categories (particularly when the weak model is a vision-only model), thus it is infeasible to optimize the CPL loss to teach the k prototypes. - The ideas of using the text encoder for initialization and dro
1. This paper is the first to explore weak-to-strong generalization in CLIP-based classification. 2. This paper proposes a straightforward yet effective method for achieving good performance.
1. The aim of weak-to-strong generalization is to mitigate harmful outputs while enhancing model performance. However, in your experimental results, while an improvement in the strong model's performance is evident, the aspect of protection against harmful outputs is not sufficiently demonstrated. 2. You have only validated your method on the DomainNet dataset. We recommend testing the effectiveness of your approach on additional datasets.
1.As claimed by the authors, their CPL method archives SOTA on many kinds of weak models(including resnet and cvt), when testing on six distinct domains of DomainNet dataset. 2.Unlike traditional knowledge distillation, weak-to-strong generalization uses weaker model as the teacher, and this work firstly introduces this new knowledge distillation method to VLM. In fact, the research is quite interesting and meaningful.
1.More related experiments are needed to increase confidence of your method, as experimental result comparisons with counterparts on many important datasets are missing. 2.There are some imprecise statements in the paper, and the CPL method lacks visual demonstration of results. It is necessary to add the relevant content in the appendix. Please see the questions section for more details.
- This paper explores weak-to-strong generalization -- how to train models when models surpass human knowledge. Unlike previous works that consider LLM, it works on CLIP, a VLM. - Experiments show that the proposed method shows improvements over other CLIP tuning methods.
- The paper starts with an ambitious story (weak-to-strong generalization) but remains unclear how the setting and solution can benefit human-surpassing models. The scope of the paper narrows down to a specific application in CLIP classification, which is disconnected from the overarching goals of weak-to-strong generalization. If the major novelty is considering a VLM, why not consider LLaVA, BLIP2, or similar models? - The setting and proposed method appear to be somewhat ad-hoc focused on imp
The proposed method is quite simple and easy-to-follow.
The reviewer may lack familiarity with recent work on *weak-to-strong generalization* and will therefore wait additional perspectives from other reviewers to assess the technical novelty of this paper. Based on the my expertise in representation learning and knowledge distillation, I have the following concerns and questions: 1. **Methodology**: The approach proposed in this paper issomewhat straightforward, essentially using the original KD (Knowledge Distillation) loss. This may detract from
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Natural Language Processing Techniques · Multimodal Machine Learning Applications
