Conceptual Codebook Learning for Vision-Language Models
Yi Zhang, Ke Yu, Siqi Wu, Zhihai He

TL;DR
This paper introduces CoCoLe, a novel fine-tuning approach for vision-language models that uses a conceptual codebook of visual concepts to improve generalization in few-shot learning scenarios.
Contribution
We propose Conceptual Codebook Learning (CoCoLe), a new method that enhances VLMs' generalization by leveraging a visual concept codebook and a concept cache during fine-tuning.
Findings
Outperforms state-of-the-art methods in various evaluation settings.
Improves alignment between visual and linguistic modalities.
Effective in low-shot and cross-dataset scenarios.
Abstract
In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial role in generalization tasks. Motivated by this interesting finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder's outputs and the text encoder's inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications
