Enhance Vision-Language Alignment with Noise

Sida Huang; Hongyuan Zhang; Xuelong Li

arXiv:2412.10817·cs.CV·December 18, 2024·2 cites

Enhance Vision-Language Alignment with Noise

Sida Huang, Hongyuan Zhang, Xuelong Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel noise-based fine-tuning method called PiNI for vision-language models like CLIP, improving their alignment in few-shot classification tasks by injecting beneficial noise into encoders.

Contribution

It proposes a new scheme to learn beneficial noise distributions for fine-tuning frozen VL models without extra modules, using variational inference to generate positive-incentive noise.

Findings

01

PiNI improves alignment across 11 datasets.

02

Beneficial noise enhances diversity of embeddings.

03

Method outperforms traditional fine-tuning approaches.

Abstract

With the advancement of pre-trained vision-language (VL) models, enhancing the alignment between visual and linguistic modalities in downstream tasks has emerged as a critical challenge. Different from existing fine-tuning methods that add extra modules to these two modalities, we investigate whether the frozen model can be fine-tuned by customized noise. Our approach is motivated by the scientific study of beneficial noise, namely Positive-incentive Noise (Pi-noise or $π$ -noise) , which quantitatively analyzes the impact of noise. It therefore implies a new scheme to learn beneficial noise distribution that can be employed to fine-tune VL models. Focusing on few-shot classification tasks based on CLIP, we reformulate the inference process of CLIP and apply variational inference, demonstrating how to generate $π$ -noise towards visual and linguistic modalities. Then, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hyzhang98/pini
pytorchOfficial

Videos

Enhance Vision-Language Alignment with Noise· underline

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training · ALIGN