Conceptual Codebook Learning for Vision-Language Models

Yi Zhang; Ke Yu; Siqi Wu; Zhihai He

arXiv:2407.02350·cs.CV·July 16, 2024

Conceptual Codebook Learning for Vision-Language Models

Yi Zhang, Ke Yu, Siqi Wu, Zhihai He

PDF

Open Access

TL;DR

This paper introduces CoCoLe, a novel fine-tuning approach for vision-language models that uses a conceptual codebook of visual concepts to improve generalization in few-shot learning scenarios.

Contribution

We propose Conceptual Codebook Learning (CoCoLe), a new method that enhances VLMs' generalization by leveraging a visual concept codebook and a concept cache during fine-tuning.

Findings

01

Outperforms state-of-the-art methods in various evaluation settings.

02

Improves alignment between visual and linguistic modalities.

03

Effective in low-shot and cross-dataset scenarios.

Abstract

In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial role in generalization tasks. Motivated by this interesting finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder's outputs and the text encoder's inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications