Localized Latent Updates for Fine-Tuning Vision-Language Models

Moritz Ibing; Isaak Lim; Leif Kobbelt

arXiv:2212.06556·cs.CV·December 14, 2022

Localized Latent Updates for Fine-Tuning Vision-Language Models

Moritz Ibing, Isaak Lim, Leif Kobbelt

PDF

Open Access

TL;DR

This paper introduces a lightweight adapter for vision-language models that updates predictions locally around seen data points, enabling fast fine-tuning with maintained performance on unseen data, especially effective in few-shot learning.

Contribution

It proposes a novel localized update method that improves fine-tuning efficiency and preserves generalization in vision-language models.

Findings

01

Effective in few-shot learning scenarios

02

Comparable or better performance on seen and unseen classes

03

Fast and lightweight adaptation process

Abstract

Although massive pre-trained vision-language models like CLIP show impressive generalization capabilities for many tasks, still it often remains necessary to fine-tune them for improved performance on specific datasets. When doing so, it is desirable that updating the model is fast and that the model does not lose its capabilities on data outside of the dataset, as is often the case with classical fine-tuning approaches. In this work we suggest a lightweight adapter, that only updates the models predictions close to seen datapoints. We demonstrate the effectiveness and speed of this relatively simple approach in the context of few-shot learning, where our results both on classes seen and unseen during training are comparable with or improve on the state of the art.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training