HyperCLIP: Adapting Vision-Language models with Hypernetworks
Victor Akinwande, Mohammad Sadegh Norouzzadeh, Devin Willmott, Anna, Bair, Madan Ravi Ganesh, J. Zico Kolter

TL;DR
HyperCLIP introduces a resource-efficient vision-language model that uses a hypernetwork to adapt a small image encoder dynamically, enabling effective zero-shot classification with minimal overhead.
Contribution
It proposes HyperCLIP, a novel architecture combining a small image encoder and a hypernetwork trained end-to-end for adaptable, deployment-friendly vision-language tasks.
Findings
Increases zero-shot ImageNet accuracy by up to 3%.
Improves CIFAR-100 accuracy by up to 5%.
Maintains minimal training overhead.
Abstract
Self-supervised vision-language models trained with contrastive objectives form the basis of current state-of-the-art methods in AI vision tasks. The success of these models is a direct consequence of the huge web-scale datasets used to train them, but they require correspondingly large vision components to properly learn powerful and general representations from such a broad data domain. This poses a challenge for deploying large vision-language models, especially in resource-constrained environments. To address this, we propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork that dynamically adapts image encoder weights to each new set of text inputs. All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end, and with a trained HyperCLIP model, we can…
Peer Reviews
Decision·Submitted to ICLR 2025
1. As shown in Table 2, the proposed hyper-network can improve the zero-shot accuracy on ImageNet by up to 3%.
**Unclear Motivation and Unsupported Claims by Experiment Results** The motivation for using a hyper-network to address CLIP's efficiency challenges in edge computing applications is unclear. The introduction does not clearly explain this choice. The experiments show performance improvement over a baseline without a hyper-network but fail to address the efficiency problem. In other words, the hyper-network enhances performance only in small-scale models rather than improving CLIP's efficiency d
1. The paper proposes using a hypernetwork to generate weights for a smaller image encoder within the SigLIP contrastive pre-training framework, allowing for task-specific specialization without extensive retraining. 2. HyperCLIP achieves significant improvements in zero-shot accuracy on ImageNet and CIFAR-100, with minimal overhead, making it suitable for resource-constrained environments. 3. The method is compatible with any type of contrastive pre-training, enhancing its versatility. 4. The p
1. By focusing on adapting only normalization parameters, the paper may not fully leverage the potential of hypernetworks to modify other model parameters.
1. The approach presents an efficient way of using smaller models to deploy vision-langauge models for resource-constrained real-world applications. 2. The adaptive approach that modifies weights at test-time for VLMs/CLIP is novel. 3. The datasets used in experiments align well with prior works.
1. The approach doesn't generalize to broader VLMs especially the larger models. CLIP models are generally the smaller models among recent VLMs. I don't think the proposed approach generalizes to the larger VLM models such as LLaVa [1] that uses very large decoder-only transformer LLM as component. 2. CLIP models have wide range of applications, among which an important one is to use the visual features produced by the image encoder as inputs to downstream models. HyperCLIP reduces CLIP applicat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques
MethodsSparse Evolutionary Training · HyperNetwork
