MuMIC -- Multimodal Embedding for Multi-label Image Classification with Tempered Sigmoid
Fengjun Wang, Sarai Mizrachi, Moran Beladev, Guy Nadav, Gil Amsalem,, Karen Lastmann Assaraf, Hadas Harush Boker

TL;DR
MuMIC leverages contrastively pretrained multimodal models with a tempered sigmoid loss for high-performance multi-label image classification, effectively handling noisy data and enabling zero-shot predictions.
Contribution
This paper introduces MuMIC, the first adaptation of contrastively learnt multimodal pretraining for real-world multi-label image classification tasks.
Findings
Achieved 85.6% GAP@10 on Booking.com images
Outperformed state-of-the-art models in multi-label classification
Supported zero-shot and domain-specific predictions
Abstract
Multi-label image classification is a foundational topic in various domains. Multimodal learning approaches have recently achieved outstanding results in image representation and single-label image classification. For instance, Contrastive Language-Image Pretraining (CLIP) demonstrates impressive image-text representation learning abilities and is robust to natural distribution shifts. This success inspires us to leverage multimodal learning for multi-label classification tasks, and benefit from contrastively learnt pretrained models. We propose the Multimodal Multi-label Image Classification (MuMIC) framework, which utilizes a hardness-aware tempered sigmoid based Binary Cross Entropy loss function, thus enables the optimization on multi-label objectives and transfer learning on CLIP. MuMIC is capable of providing high classification performance, handling real-world noisy data,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsText and Document Classification Technologies · Machine Learning in Bioinformatics · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
