Localized Latent Updates for Fine-Tuning Vision-Language Models
Moritz Ibing, Isaak Lim, Leif Kobbelt

TL;DR
This paper introduces a lightweight adapter for vision-language models that updates predictions locally around seen data points, enabling fast fine-tuning with maintained performance on unseen data, especially effective in few-shot learning.
Contribution
It proposes a novel localized update method that improves fine-tuning efficiency and preserves generalization in vision-language models.
Findings
Effective in few-shot learning scenarios
Comparable or better performance on seen and unseen classes
Fast and lightweight adaptation process
Abstract
Although massive pre-trained vision-language models like CLIP show impressive generalization capabilities for many tasks, still it often remains necessary to fine-tune them for improved performance on specific datasets. When doing so, it is desirable that updating the model is fast and that the model does not lose its capabilities on data outside of the dataset, as is often the case with classical fine-tuning approaches. In this work we suggest a lightweight adapter, that only updates the models predictions close to seen datapoints. We demonstrate the effectiveness and speed of this relatively simple approach in the context of few-shot learning, where our results both on classes seen and unseen during training are comparable with or improve on the state of the art.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training
