Language Models as Black-Box Optimizers for Vision-Language Models
Shihong Liu, Zhiqiu Lin, Samuel Yu, Ryan Lee, Tiffany Ling, and Deepak Pathak, Deva Ramanan

TL;DR
This paper introduces a black-box optimization method using chat-based LLMs to find effective prompts for vision-language models, outperforming existing methods without requiring access to model internals.
Contribution
The paper presents a novel black-box prompt optimization approach employing LLMs and an automatic hill-climbing procedure, enabling effective VLM tuning without model access.
Findings
Outperforms white-box prompt tuning methods like CoOp by 1.5% on average across 11 datasets.
Generated prompts are more interpretable and transfer well across different VLM architectures.
Successfully applied to optimize DALL-E 3 for various tasks including text-to-image generation.
Abstract
Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic hill-climbing procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The paper is very well written and polished. The method is very intuitive and easy to implement. The results presented show some gains over CoOp
The method is very simple and the "our approach" part is a single paragraph. The idea is not very surprising as it is basically asking an LLM to do what a person would do manually, although it is interesting to see that it can surpass coop. Comparisons are limited to the one-shot scenario. It is common in the prompting literature to provide results with 1, 2, 4, 8 and 16 shots (or at least the latter I believe?). This might look a bit like cherry-picking for maximal gains. I understand that c
1. The paper utilizes ChatGPT to mitigate the manual annotation efforts. 2. The paper provides the estimated cost of the OpenAI APIs.
1. The paper didn't provide an experimental comparison to other black-box approaches, such as the methods mentioned in the related work (heuristic-based editing, continuous prefix-tuning, discrete token searching). 2. The approach only works better than Coop in the one-shot setting and does not scale well with the number of shots (Table 6), limiting the usefulness of the method. -------- There are several typos "optimizier" on page 3 (maybe in other places too).
- The proposed method shows improvement over the gradient-based prompt engineering method CoOp in the one-shot setting, which seems to suggest the proposed method is less prone to overfitting than gradient-based methods. - The proposed method can have applications on extremely low-shot settings and/or settings where the scoring function is a black box.
- My major concern is on the applicability of this method. This method is probably suitable on extremely low-shot settings but once the number of training samples increases slightly, its advantage disappears quickly (e.g., with more than 4 shots). Moreover, most vision-language models are not black boxes and thus allow for gradient-based optimization. For this proposed method to work, it would be difficult to find applicable real scenarios where the vision-language model is a black box and there
1、Numerous existing vision-language models (VLMs) are not releasing their weights due to privacy and legal concerns, so it is difficult to employ current parameter-efficient tuning (PET) methods to adapt them to downstream datasets. Therefore, this paper proposes employing chat-based LLMs as black-box optimizers to search for the best text prompt on the illustrative task of few-shot image classification. 2、Technically, they adopt an automatic “hill-climbing” procedure (seems like an RL process
1、The novelty of this paper is limited, as it shares similarities with APE[1] in utilizing LLMs to explore more efficient prompt templates although authors claim that their target is to optimize the prompt template for VLMs and adopt a similar hill-climbing strategy as the prompt optimization. This is more like an engineering improvement for optimizing prompt templates, not enough for a top-tier conference of ICLR that needs more explainable insights. 2、As illustrated in this paper, in the high
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsContext Optimization · Contrastive Language-Image Pre-training
