Infusing fine-grained visual knowledge to Vision-Language Models

Nikolaos-Antonios Ypsilantis; Kaifeng Chen; Andr\'e Araujo; Ond\v{r}ej Chum

arXiv:2508.12137·cs.CV·August 19, 2025

Infusing fine-grained visual knowledge to Vision-Language Models

Nikolaos-Antonios Ypsilantis, Kaifeng Chen, Andr\'e Araujo, Ond\v{r}ej Chum

PDF

Open Access

TL;DR

This paper introduces a fine-tuning method for Vision-Language Models that enhances their ability for fine-grained visual retrieval while preserving their general multimodal knowledge, without requiring additional text data.

Contribution

The authors propose a novel fine-tuning approach inspired by continual learning that balances domain adaptation with knowledge retention, improving retrieval performance across various datasets.

Findings

01

Consistently improves fine-grained retrieval accuracy.

02

Retains multimodal knowledge without using text data during fine-tuning.

03

Effective across multiple benchmarks and pretrained models.

Abstract

Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs) capable of generating representations (embeddings) effective for a wide variety of visual and multimodal tasks. However, these pretrained embeddings remain suboptimal for fine-grained open-set visual retrieval, where state-of-the-art results require fine-tuning the vision encoder using annotated domain-specific samples. Naively performing such fine-tuning typically leads to catastrophic forgetting, severely diminishing the model's general-purpose visual and cross-modal capabilities. In this work, we propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM's broad multimodal knowledge. Drawing inspiration from continual learning literature, we systematically analyze standard regularization techniques aimed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications