Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models
Shuai Fu, Xiequn Wang, Qiushi Huang, Yu Zhang

TL;DR
Nemesis introduces a normalization technique for soft prompts in vision-language models, revealing that adjusting prompt norms can improve model performance and offering new insights into prompt tuning strategies.
Contribution
This work is the first to systematically analyze the impact of soft prompt norms in VLMs and proposes a normalization method called Nemesis to enhance their performance.
Findings
Reducing prompt norms can improve VLM performance.
Increasing prompt norms often degrades model accuracy.
Normalization of soft prompts leads to better downstream task results.
Abstract
With the prevalence of large-scale pretrained vision-language models (VLMs), such as CLIP, soft-prompt tuning has become a popular method for adapting these models to various downstream tasks. However, few works delve into the inherent properties of learnable soft-prompt vectors, specifically the impact of their norms to the performance of VLMs. This motivates us to pose an unexplored research question: ``Do we need to normalize the soft prompts in VLMs?'' To fill this research gap, we first uncover a phenomenon, called the \textbf{Low-Norm Effect} by performing extensive corruption experiments, suggesting that reducing the norms of certain learned prompts occasionally enhances the performance of VLMs, while increasing them often degrades it. To harness this effect, we propose a novel method named \textbf{N}ormalizing th\textbf{e} soft-pro\textbf{m}pt v\textbf{e}ctors of…
Peer Reviews
Decision·ICLR 2024 spotlight
1、The paper is the first study to discuss the influence of soft-prompt toward VLM. 2、The paper conducted REPLACE and RESCALE to discuss the normalization of soft-prompt, and proposed Nemesis including two normalization losses to improve the effectiveness of soft-prompt. 3、The paper has conducted a lot of experiments to prove the effectiveness of the method.
1、The writing of some parts of the paper are not clear enough. It is recommended that the authors check. For example, there is a discrepancy between formula 4 and the symbol definition in the previous paragraph. 2、The two types of losses proposed in the paper lack a correlation with practical significance, suggesting authors discuss why the two forms of normalization affect soft prompt. 3、The paper lacks discussion on the applicable scenarios of two normalization losses.
1. The paper pioneers a systematic investigation into the role of soft-prompt vector norms in VLMs, addressing a previously unexplored research question. 2. The proposed Nemesis method, with its innovative PEN and PAN losses, offers a potential solution to the Low-Norm Effect, showing promise for improving VLM performance. 3. Extensive corruption experiments shed light on the Low-Norm Effect's impact, providing valuable insights for future soft-prompt tuning endeavors.
1. $\beta$ can be either 0 or 1, corresponding to two variants of the proposed Nemesis method. However, there is no ablation study on the selection of $\beta$, nor is there an exploration of the potential impact of setting $\beta$ with decimal values to assign weights to the two methods. 2. The paper introduces a pre-inference step before each training batch to identify positions inducing the Low-Norm Effect. Such a step could introduce computational overhead, especially with larger datasets or
(1) new soft-prompt vector normalization method for VLMs, which can be incorporated into any soft-prompt based methods; (2) better results when evaluated by domain generalization settings for VLMs.
1. prefer to learn more details of how you decide the length of soft prompt vectors, e.g., why 4 and 16, will there be more ranges to be investigated basing on the specificl tasks for VLMs? 2. prefer to learn more investigations of combining Nemesis with existing PEFT algorithms to see if the results can be further improved or not so that other researchers can better leverage your method to their existing frameworks.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
