Extending CLIP for Category-to-image Retrieval in E-commerce
Mariya Hendriksen, Maurits Bleeker, Svitlana Vakulenko, Nanne van, Noord, Ernst Kuiper, and Maarten de Rijke

TL;DR
This paper introduces CLIP-ITA, a multimodal model for category-to-image retrieval in e-commerce, effectively combining textual, visual, and attribute data to improve search accuracy.
Contribution
The paper proposes a novel multimodal model, CLIP-ITA, specifically designed for category-to-image retrieval in e-commerce, leveraging multiple data modalities for enhanced performance.
Findings
CLIP-ITA outperforms visual-only models in retrieval tasks.
Adding attribute information improves model accuracy.
Multimodal integration enhances e-commerce search effectiveness.
Abstract
E-commerce provides rich multimodal data that is barely leveraged in practice. One aspect of this data is a category tree that is being used in search and recommendation. However, in practice, during a user's session there is often a mismatch between a textual and a visual representation of a given category. Motivated by the problem, we introduce the task of category-to-image retrieval in e-commerce and propose a model for the task, CLIP-ITA. The model leverages information from multiple modalities (textual, visual, and attribute modality) to create product representations. We explore how adding information from multiple modalities (textual, visual, and attribute modality) impacts the model's performance. In particular, we observe that CLIP-ITA significantly outperforms a comparable model that leverages only the visual modality and a comparable model that leverages the visual and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
