V$^2$L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval
Wenhao Wang, Yifan Sun, Zongxin Yang, Yi Yang

TL;DR
This paper presents a top-performing ensemble approach for large-scale product retrieval in ecommerce, combining vision and vision-language models with a coarse-to-fine training strategy to improve accuracy.
Contribution
It introduces a novel ensemble method that leverages the complementarity of vision and vision-language models with a two-stage training pipeline for product retrieval.
Findings
Achieved 0.7623 MAR@10, first place in eBay eProduct Visual Search Challenge.
Demonstrated benefits of combining vision and vision-language models.
Implemented a coarse-to-fine metric learning approach for improved retrieval.
Abstract
Product retrieval is of great importance in the ecommerce domain. This paper introduces our 1st-place solution in eBay eProduct Visual Search Challenge (FGVC9), which is featured for an ensemble of about 20 models from vision models and vision-language models. While model ensemble is common, we show that combining the vision models and vision-language models brings particular benefits from their complementarity and is a key factor to our superiority. Specifically, for the vision models, we use a two-stage training pipeline which first learns from the coarse labels provided in the training set and then conducts fine-grained self-supervised training, yielding a coarse-to-fine metric learning manner. For the vision-language models, we use the textual description of the training image as the supervision signals for fine-tuning the image-encoder (feature extractor). With these designs, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsLinear Layer · Softmax · Multi-Head Attention · Dense Connections · Attention Is All You Need · Residual Connection · Layer Normalization · Vision Transformer · Neighborhood Attention
