Imperfect Vision Encoders: Efficient and Robust Tuning for   Vision-Language Models

Aristeidis Panos; Rahaf Aljundi; Daniel Olmeda Reino; Richard E; Turner

arXiv:2407.16526·cs.CV·July 24, 2024

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Aristeidis Panos, Rahaf Aljundi, Daniel Olmeda Reino, Richard E, Turner

PDF

TL;DR

This paper introduces an efficient method for selectively updating vision encoders in vision-language models, improving their accuracy and robustness without sacrificing overall performance, especially in challenging data scenarios.

Contribution

It proposes a novel selective and local update technique for vision encoders that enhances VLM performance and robustness during continual few-shot learning.

Findings

01

Significant performance gains on error-prone data.

02

Maintains robustness while improving accuracy.

03

Effective in continual few-shot update scenarios.

Abstract

Vision language models (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and language models. However, existing open-source VLMs heavily rely on pretrained and frozen vision encoders (such as CLIP). Despite CLIP's robustness across diverse domains, it still exhibits non-negligible image understanding errors. These errors propagate to the VLM responses, resulting in sub-optimal performance. In our work, we propose an efficient and robust method for updating vision encoders within VLMs. Our approach selectively and locally updates encoders, leading to substantial performance improvements on data where previous mistakes occurred, while maintaining overall robustness. Furthermore, we demonstrate the effectiveness of our method during continual few-shot updates. Theoretical grounding, generality, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.