CLIP-Decoder : ZeroShot Multilabel Classification using Multimodal CLIP   Aligned Representation

Muhammad Ali; Salman Khan

arXiv:2406.14830·cs.CV·June 24, 2024

CLIP-Decoder : ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representation

Muhammad Ali, Salman Khan

PDF

Open Access 1 Repo

TL;DR

The paper introduces CLIP-Decoder, a multimodal approach leveraging CLIP for zero-shot multilabel classification, aligning image and text embeddings to improve performance on unseen categories.

Contribution

It presents a novel multimodal learning method that aligns image and text embeddings for enhanced zero-shot multilabel classification, outperforming existing approaches.

Findings

01

Achieves 3.9% performance increase over existing methods.

02

Shows 2.3% improvement in generalized zero-shot learning.

03

Outperforms state-of-the-art on zero-shot multilabel tasks.

Abstract

Multi-label classification is an essential task utilized in a wide variety of real-world applications. Multi-label zero-shot learning is a method for classifying images into multiple unseen categories for which no training data is available, while in general zero-shot situations, the test set may include observed classes. The CLIP-Decoder is a novel method based on the state-of-the-art ML-Decoder attention-based head. We introduce multi-modal representation learning in CLIP-Decoder, utilizing the text encoder to extract text features and the image encoder for image feature extraction. Furthermore, we minimize semantic mismatch by aligning image and word embeddings in the same dimension and comparing their respective representations using a combined loss, which comprises classification loss and CLIP loss. This strategy outperforms other methods and we achieve cutting-edge results on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YCAyca/CLIP_ML_Decoder
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training