AWT: Transferring Vision-Language Models via Augmentation, Weighting,   and Transportation

Yuhan Zhu; Yuyang Ji; Zhiyu Zhao; Gangshan Wu; Limin Wang

arXiv:2407.04603·cs.CV·October 8, 2024·2 cites

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, Limin Wang

PDF

Open Access 1 Repo

TL;DR

The paper introduces AWT, a novel framework that enhances vision-language models' adaptability to new concepts through augmentation, weighting, and optimal transport, improving zero-shot and few-shot learning performance.

Contribution

AWT is a new adaptation framework that integrates data augmentation, dynamic input weighting, and semantic transport to improve VLMs without additional training.

Findings

01

AWT outperforms state-of-the-art methods in zero-shot image classification.

02

AWT enhances few-shot learning with an integrated multimodal adapter.

03

AWT demonstrates robustness across different VLM architectures and scales.

Abstract

Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MCG-NJU/AWT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training · Adapter