Efficient Few-Shot Continual Learning in Vision-Language Models

Aristeidis Panos; Rahaf Aljundi; Daniel Olmeda Reino; Richard E.; Turner

arXiv:2502.04098·cs.CV·February 10, 2025

Efficient Few-Shot Continual Learning in Vision-Language Models

Aristeidis Panos, Rahaf Aljundi, Daniel Olmeda Reino, Richard E., Turner

PDF

Open Access

TL;DR

This paper introduces LoRSU, a method for efficient, structured updates to image encoders in vision-language models, enabling effective few-shot continual learning with significantly reduced computational costs.

Contribution

LoRSU is a novel approach that selectively updates critical parameters in image encoders, improving efficiency and performance in continual learning scenarios.

Findings

01

Reduces computational overhead by over 25x compared to full model updates.

02

Maintains high performance in few-shot continual learning tasks.

03

Demonstrates scalability and robustness across VQA benchmarks.

Abstract

Vision-language models (VLMs) excel in tasks such as visual question answering and image captioning. However, VLMs are often limited by their use of pretrained image encoders, like CLIP, leading to image understanding errors that hinder overall performance. On top of that, real-world applications often require the model to be continuously adapted as new and often limited data continuously arrive. To address this, we propose LoRSU (Low-Rank Adaptation with Structured Updates), a robust and computationally efficient method for selectively updating image encoders within VLMs. LoRSU introduces structured and localized parameter updates, effectively correcting performance on previously error-prone data while preserving the model's general robustness. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Text and Document Classification Technologies

MethodsContrastive Language-Image Pre-training