Attention to the Burstiness in Visual Prompt Tuning!
Yuzhu Wang, Manni Duan, Shu Kong

TL;DR
This paper introduces Bilinear Prompt Tuning (BPT), a novel method that uses whitening and low-rank bilinear models to improve prompt tuning in vision Transformers, significantly boosting accuracy and efficiency.
Contribution
It proposes a whitening-based approach and a low-rank bilinear model for prompt tuning, addressing burstiness and distribution challenges in VPT, leading to faster and more accurate training.
Findings
BPT significantly improves accuracy, e.g., +25 points on CUB dataset.
BPT outperforms existing VPT methods across multiple benchmarks.
BPT reduces parameter count and computational overhead.
Abstract
Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover ``burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Furthermore, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Science and Education Research · Design Education and Practice
MethodsDropout · Absolute Position Encodings · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · Sparse Evolutionary Training · Dense Connections · Layer Normalization · Vision Transformer
