e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce
Wonyoung Shin, Jonghun Park, Taekang Woo, Yongwoo Cho, Kwangjin Oh,, Hwanjun Song

TL;DR
This paper introduces e-CLIP, a contrastive learning framework that aligns vision and language models using unlabeled product data to improve various e-commerce search and recommendation tasks.
Contribution
It presents a large-scale vision-language representation learning approach tailored for e-commerce, addressing domain-specific challenges and demonstrating superior downstream task performance.
Findings
Outperforms baseline models in multiple downstream tasks.
Effective alignment of visual and textual product representations.
Improves accuracy in product classification and matching.
Abstract
Understanding vision and language representations of product content is vital for search and recommendation applications in e-commerce. As a backbone for online shopping platforms and inspired by the recent success in representation learning research, we propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images. We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges. We study the performance using our pre-trained model as backbones for diverse downstream tasks, including category classification, attribute extraction, product matching, product clustering, and adult product recognition. Experimental results show that our proposed method outperforms the baseline in each downstream task regarding both single modality and multiple modalities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
