Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace
Andre Rusli, Shoma Ishimoto, Sho Akiyama, Aman Kumar Singh

TL;DR
This paper demonstrates a scalable zero-shot visual search system for C2C marketplaces, showing that recent vision-language models significantly improve retrieval performance and user engagement with minimal fine-tuning.
Contribution
It introduces a real-time, scalable visual search pipeline using zero-shot models, validated by offline metrics and online A/B testing in a production environment.
Findings
SigLIP model outperforms others with 13.3% nDCG@5 increase
Online A/B test shows up to 40.9% higher transaction rate
Zero-shot models are practical for production with minimal fine-tuning
Abstract
Visual search offers an intuitive way for customers to explore diverse product catalogs, particularly in consumer-to-consumer (C2C) marketplaces where listings are often unstructured and visually driven. This paper presents a scalable visual search system deployed in Mercari's C2C marketplace, where end-users act as buyers and sellers. We evaluate recent vision-language models for zero-shot image retrieval and compare their performance with an existing fine-tuned baseline. The system integrates real-time inference and background indexing workflows, supported by a unified embedding pipeline optimized through dimensionality reduction. Offline evaluation using user interaction logs shows that the multilingual SigLIP model outperforms other models across multiple retrieval metrics, achieving a 13.3% increase in nDCG@5 over the baseline. A one-week online A/B test in production further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
