ENCLIP: Ensembling and Clustering-Based Contrastive Language-Image Pretraining for Fashion Multimodal Search with Limited Data and Low-Quality Images
Prithviraj Purushottam Naik, Rohit Agarwal

TL;DR
ENCLIP enhances CLIP's performance for fashion multimodal search by ensembling models and clustering images, effectively addressing limited data and low-quality images to improve search accuracy.
Contribution
This paper introduces ENCLIP, a novel ensembling and clustering-based method to improve CLIP's effectiveness in fashion search with scarce and low-quality data.
Findings
Improved search accuracy in fashion multimodal tasks.
Effective handling of limited data and low-quality images.
Demonstrated superiority over baseline models.
Abstract
Multimodal search has revolutionized the fashion industry, providing a seamless and intuitive way for users to discover and explore fashion items. Based on their preferences, style, or specific attributes, users can search for products by combining text and image information. Text-to-image searches enable users to find visually similar items or describe products using natural language. This paper presents an innovative approach called ENCLIP, for enhancing the performance of the Contrastive Language-Image Pretraining (CLIP) model, specifically in Multimodal Search targeted towards the domain of fashion intelligence. This method focuses on addressing the challenges posed by limited data availability and low-quality images. This paper proposes an algorithm that involves training and ensembling multiple instances of the CLIP model, and leveraging clustering techniques to group similar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
