Attribute-based Visual Reprogramming for Vision-Language Models

Chengyi Cai; Zesheng Ye; Lei Feng; Jianzhong Qi; Feng Liu

arXiv:2501.13982·cs.CV·February 26, 2025

Attribute-based Visual Reprogramming for Vision-Language Models

Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces Attribute-based Visual Reprogramming (AttrVR) for CLIP, leveraging descriptive and distinctive attributes to improve input reprogramming, leading to better downstream task performance and more dynamic, sample-specific optimization.

Contribution

AttrVR is a novel approach that incorporates attribute-guided textual information and iterative refinement to enhance visual reprogramming for vision-language models like CLIP.

Findings

01

Achieves superior performance in 12 downstream tasks.

02

Reduces intra-class variance and increases inter-class separation.

03

Effective for both ViT-based and ResNet-based CLIP.

Abstract

Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

This paper addresses the limitations of Visual Reprogramming (VR) in vision-language models like CLIP by introducing Attribute-based Visual Reprogramming (AttrVR). Unlike traditional VR methods that rely on fixed text templates and category labels, AttrVR leverages Descriptive Attributes (DesAttrs) and Distinctive Attributes (DistAttrs) generated by large language models (e.g., GPT-3.5) to capture both common and unique features of each category. During training, AttrVR dynamically selects relev

Weaknesses

1. **Dependence on Large Language Models:** The AttrVR method relies on large language models (e.g., GPT-3.5) to generate descriptive and distinctive attributes. This dependence may increase computational costs and limit the method's applicability in resource-constrained environments. 2. **Reliability of Attribute Generation:** The attribute descriptions generated by large language models may have issues with accuracy and relevance. If the generated attributes do not match the target categories

Reviewer 02Rating 6Confidence 4

Strengths

- The paper presents a neat incorporation of the vast knowledge of LLMs to improve downstream classification of vision language models. The use of descriptive and distinctive attributes of classes to leverage the language capability in vision-language joint pre-training is well validated through experimental results and theoretical backing. - The proposed method achieves excellent improvements above existing visual reprogramming methods across a diverse set of visual classification datasets and

Weaknesses

- Is there a way to ascertain that the distinctive captions generated are actually distinct in the embedding space? While lines 78-79 claims this is the case in Figure 1, I disagree. The distinctive attributes seem to not necessarily be farther from the cluster center compared to descriptive attributes. The authors could compare of cosine distance of descriptive attributes and distinctive attributes with the class labels. Ideally, there should be a higher distance from other classes for distinct

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper is well-written, which provides theoretical insights showing how AttrVR reduces intra-class variance and increases inter-class separation, enhancing the model’s discriminative power. 2. The motivation for this submission is easy to understand, which leverages descriptive and distinctive attributes instead of traditional label prompts, maximizing CLIP’s multimodal capabilities to improve classification accuracy. 3. AttrVR is tested across multiple datasets, with results indicating su

Weaknesses

In general, the motivation for this submission is easy to understand and insight is interesting. However, there are still several weaknesses, as follows: 1. While the paper mentions using GPT-3.5 for generating descriptive and distinctive attributes (DesAttrs and DistAttrs), it lacks detailed reasoning on why GPT-3.5 was specifically chosen over other potential models. Further, the method relies heavily on the accuracy and quality of DesAttrs and DistAttrs generated by a language model. Howeve

Code & Models

Repositories

tmlr-group/attrvr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsAverage Pooling · Kaiming Initialization · Global Average Pooling · Max Pooling · Contrastive Language-Image Pre-training · Convolution