V$^2$L: Leveraging Vision and Vision-language Models into Large-scale   Product Retrieval

Wenhao Wang; Yifan Sun; Zongxin Yang; Yi Yang

arXiv:2207.12994·cs.CV·July 27, 2022

V$^2$L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval

Wenhao Wang, Yifan Sun, Zongxin Yang, Yi Yang

PDF

Open Access 1 Repo

TL;DR

This paper presents a top-performing ensemble approach for large-scale product retrieval in ecommerce, combining vision and vision-language models with a coarse-to-fine training strategy to improve accuracy.

Contribution

It introduces a novel ensemble method that leverages the complementarity of vision and vision-language models with a two-stage training pipeline for product retrieval.

Findings

01

Achieved 0.7623 MAR@10, first place in eBay eProduct Visual Search Challenge.

02

Demonstrated benefits of combining vision and vision-language models.

03

Implemented a coarse-to-fine metric learning approach for improved retrieval.

Abstract

Product retrieval is of great importance in the ecommerce domain. This paper introduces our 1st-place solution in eBay eProduct Visual Search Challenge (FGVC9), which is featured for an ensemble of about 20 models from vision models and vision-language models. While model ensemble is common, we show that combining the vision models and vision-language models brings particular benefits from their complementarity and is a key factor to our superiority. Specifically, for the vision models, we use a two-stage training pipeline which first learns from the coarse labels provided in the training set and then conducts fine-grained self-supervised training, yielding a coarse-to-fine metric learning manner. For the vision-language models, we use the textual description of the training image as the supervision signals for fine-tuning the image-encoder (feature extractor). With these designs, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangwenhao0716/v2l
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsLinear Layer · Softmax · Multi-Head Attention · Dense Connections · Attention Is All You Need · Residual Connection · Layer Normalization · Vision Transformer · Neighborhood Attention