Attribute-based Visual Reprogramming for Vision-Language Models
Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu

TL;DR
This paper introduces Attribute-based Visual Reprogramming (AttrVR) for CLIP, leveraging descriptive and distinctive attributes to improve input reprogramming, leading to better downstream task performance and more dynamic, sample-specific optimization.
Contribution
AttrVR is a novel approach that incorporates attribute-guided textual information and iterative refinement to enhance visual reprogramming for vision-language models like CLIP.
Findings
Achieves superior performance in 12 downstream tasks.
Reduces intra-class variance and increases inter-class separation.
Effective for both ViT-based and ResNet-based CLIP.
Abstract
Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may…
Peer Reviews
Decision·ICLR 2025 Poster
This paper addresses the limitations of Visual Reprogramming (VR) in vision-language models like CLIP by introducing Attribute-based Visual Reprogramming (AttrVR). Unlike traditional VR methods that rely on fixed text templates and category labels, AttrVR leverages Descriptive Attributes (DesAttrs) and Distinctive Attributes (DistAttrs) generated by large language models (e.g., GPT-3.5) to capture both common and unique features of each category. During training, AttrVR dynamically selects relev
1. **Dependence on Large Language Models:** The AttrVR method relies on large language models (e.g., GPT-3.5) to generate descriptive and distinctive attributes. This dependence may increase computational costs and limit the method's applicability in resource-constrained environments. 2. **Reliability of Attribute Generation:** The attribute descriptions generated by large language models may have issues with accuracy and relevance. If the generated attributes do not match the target categories
- The paper presents a neat incorporation of the vast knowledge of LLMs to improve downstream classification of vision language models. The use of descriptive and distinctive attributes of classes to leverage the language capability in vision-language joint pre-training is well validated through experimental results and theoretical backing. - The proposed method achieves excellent improvements above existing visual reprogramming methods across a diverse set of visual classification datasets and
- Is there a way to ascertain that the distinctive captions generated are actually distinct in the embedding space? While lines 78-79 claims this is the case in Figure 1, I disagree. The distinctive attributes seem to not necessarily be farther from the cluster center compared to descriptive attributes. The authors could compare of cosine distance of descriptive attributes and distinctive attributes with the class labels. Ideally, there should be a higher distance from other classes for distinct
1. The paper is well-written, which provides theoretical insights showing how AttrVR reduces intra-class variance and increases inter-class separation, enhancing the model’s discriminative power. 2. The motivation for this submission is easy to understand, which leverages descriptive and distinctive attributes instead of traditional label prompts, maximizing CLIP’s multimodal capabilities to improve classification accuracy. 3. AttrVR is tested across multiple datasets, with results indicating su
In general, the motivation for this submission is easy to understand and insight is interesting. However, there are still several weaknesses, as follows: 1. While the paper mentions using GPT-3.5 for generating descriptive and distinctive attributes (DesAttrs and DistAttrs), it lacks detailed reasoning on why GPT-3.5 was specifically chosen over other potential models. Further, the method relies heavily on the accuracy and quality of DesAttrs and DistAttrs generated by a language model. Howeve
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsAverage Pooling · Kaiming Initialization · Global Average Pooling · Max Pooling · Contrastive Language-Image Pre-training · Convolution
