Enhance Vision-Language Alignment with Noise
Sida Huang, Hongyuan Zhang, Xuelong Li

TL;DR
This paper introduces a novel noise-based fine-tuning method called PiNI for vision-language models like CLIP, improving their alignment in few-shot classification tasks by injecting beneficial noise into encoders.
Contribution
It proposes a new scheme to learn beneficial noise distributions for fine-tuning frozen VL models without extra modules, using variational inference to generate positive-incentive noise.
Findings
PiNI improves alignment across 11 datasets.
Beneficial noise enhances diversity of embeddings.
Method outperforms traditional fine-tuning approaches.
Abstract
With the advancement of pre-trained vision-language (VL) models, enhancing the alignment between visual and linguistic modalities in downstream tasks has emerged as a critical challenge. Different from existing fine-tuning methods that add extra modules to these two modalities, we investigate whether the frozen model can be fine-tuned by customized noise. Our approach is motivated by the scientific study of beneficial noise, namely Positive-incentive Noise (Pi-noise or -noise) , which quantitatively analyzes the impact of noise. It therefore implies a new scheme to learn beneficial noise distribution that can be employed to fine-tune VL models. Focusing on few-shot classification tasks based on CLIP, we reformulate the inference process of CLIP and apply variational inference, demonstrating how to generate -noise towards visual and linguistic modalities. Then, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training · ALIGN
