AutoVP: An Automated Visual Prompting Framework and Benchmark
Hsi-Ai Tsao, Lei Hsiung, Pin-Yu Chen, Sijia Liu, Tsung-Yi Ho

TL;DR
AutoVP is an automated framework and benchmark for visual prompting, improving the design and evaluation of prompt-based fine-tuning of vision models across multiple tasks, leading to significant performance gains.
Contribution
It introduces AutoVP, an end-to-end automated framework for visual prompting design and a comprehensive benchmark with 12 tasks, advancing VP research and application.
Findings
AutoVP outperforms existing VP methods by up to 6.7% accuracy.
AutoVP achieves a maximum performance increase of 27.5% over linear probing.
The framework effectively automates hyperparameter tuning for VP.
Abstract
Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks. However, there has hitherto been little systematic study of the design space of VP and no clear benchmark for evaluating its performance. To bridge this gap, we propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks that can serve as a holistic VP-performance benchmark. Our design space covers 1) the joint optimization of the prompts; 2) the selection of pre-trained models, including image classifiers and text-image encoders; and 3) model output mapping strategies, including nonparametric and trainable label mapping. Our extensive experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin,…
Peer Reviews
Decision·ICLR 2024 poster
**Clarity and Logic**: The paper is well-structured and presents complex ideas clearly, making it understandable for readers. **Useful Framework**: AutoVP is introduced as a versatile toolbox that simplifies the development of visual prompts, offering a modular design and comprehensive functionalities. **Improved Performance**: The models tuned with AutoVP demonstrate a significant performance improvement over previous baselines across various image classification tasks.
**Limited Novelty**: The framework largely combines existing methods, which might suggest a wrap-up of previous work rather than introducing new concepts, limiting the perceived novelty of the research. **Potential Overfitting**: AutoVP uses different settings for different datasets, raising the question of whether these are overfitted to specific tasks and what the implications are for a robust, universal setting. **Insufficient Analysis of Mapping Methods**: There is a lack of detailed compa
1. This paper presents its findings with clear figures and detailed statistical reports, making it easier for readers to grasp the results. 2. This paper does not just present a tool but embarks on a detailed exploration of optimal configurations under various conditions, aiming at proving how different settings affect performance. It also examines the impact of domain similarity on VP performance.
1. While VP can potentially be used for a variety of vision tasks, the paper seems to focus primarily on image classification tasks, which may limit its applicability to broader vision problems. Are there any additional results on dense discriminant tasks? 2. When utilizing CLIP as the pre-trained classifier within the framework, which visual backbone is employed, ViT or ResNet? 3. About the proposed VP benchmark, why do the authors exclude some widely recognized 2D datasets, such as Caltech101,
1. An extensive study for visual prompting on vision model such as ResNeXt, ViT, and CLIP model. 2. AutoVP, by applying a series of established approach, from input scaling, to output label engineering, enables huge gain on the results.
1. The paper is evaluated on 12 visual recognition tasks, what about other tasks, given that this is a benchmark paper. Say Object Detection, Depth, Segmetnation. 2. Reviewer appreciate this systematic study in applying all methods of VP and improve results. However, those results are expected. Learn a bit of new knowledge after reading this, the reviewer would expect in general more surprising finding or impressive knowledge.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications · Domain Adaptation and Few-Shot Learning
