CLIP-Decoder : ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representation
Muhammad Ali, Salman Khan

TL;DR
The paper introduces CLIP-Decoder, a multimodal approach leveraging CLIP for zero-shot multilabel classification, aligning image and text embeddings to improve performance on unseen categories.
Contribution
It presents a novel multimodal learning method that aligns image and text embeddings for enhanced zero-shot multilabel classification, outperforming existing approaches.
Findings
Achieves 3.9% performance increase over existing methods.
Shows 2.3% improvement in generalized zero-shot learning.
Outperforms state-of-the-art on zero-shot multilabel tasks.
Abstract
Multi-label classification is an essential task utilized in a wide variety of real-world applications. Multi-label zero-shot learning is a method for classifying images into multiple unseen categories for which no training data is available, while in general zero-shot situations, the test set may include observed classes. The CLIP-Decoder is a novel method based on the state-of-the-art ML-Decoder attention-based head. We introduce multi-modal representation learning in CLIP-Decoder, utilizing the text encoder to extract text features and the image encoder for image feature extraction. Furthermore, we minimize semantic mismatch by aligning image and word embeddings in the same dimension and comparing their respective representations using a combined loss, which comprises classification loss and CLIP loss. This strategy outperforms other methods and we achieve cutting-edge results on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
