The Solution for Language-Enhanced Image New Category Discovery
Haonan Xu, Dian Chao, Xiangyu Wu, Zhonghua Wan, Yang Yang

TL;DR
This paper introduces a novel approach that reverses CLIP training to create class-specific visual prompts, improving zero-shot image recognition by enhancing visual representations through contrastive learning and a dual-adapter module.
Contribution
It proposes pseudo visual prompts learned from large language model-generated data, reversing CLIP training, and a dual-adapter for better knowledge integration, advancing zero-shot recognition.
Findings
Outperforms state-of-the-art on clean annotated text data.
Achieves superior results on pseudo text data from language models.
Enhances visual representation capacity of textual labels.
Abstract
Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on textual labels to store visual information is insufficient for representing the diversity of visual objects. In this paper, we propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts. These prompts are initialized for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models. This process mines the aligned visual information in CLIP and stores it in class-specific visual prompts. We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity. Additionally, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training · Contrastive Learning
