Infusing fine-grained visual knowledge to Vision-Language Models
Nikolaos-Antonios Ypsilantis, Kaifeng Chen, Andr\'e Araujo, Ond\v{r}ej Chum

TL;DR
This paper introduces a fine-tuning method for Vision-Language Models that enhances their ability for fine-grained visual retrieval while preserving their general multimodal knowledge, without requiring additional text data.
Contribution
The authors propose a novel fine-tuning approach inspired by continual learning that balances domain adaptation with knowledge retention, improving retrieval performance across various datasets.
Findings
Consistently improves fine-grained retrieval accuracy.
Retains multimodal knowledge without using text data during fine-tuning.
Effective across multiple benchmarks and pretrained models.
Abstract
Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs) capable of generating representations (embeddings) effective for a wide variety of visual and multimodal tasks. However, these pretrained embeddings remain suboptimal for fine-grained open-set visual retrieval, where state-of-the-art results require fine-tuning the vision encoder using annotated domain-specific samples. Naively performing such fine-tuning typically leads to catastrophic forgetting, severely diminishing the model's general-purpose visual and cross-modal capabilities. In this work, we propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM's broad multimodal knowledge. Drawing inspiration from continual learning literature, we systematically analyze standard regularization techniques aimed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
