TL;DR
This paper systematically evaluates 190 open-source vision-language models for grocery product retrieval, highlighting data quality, model efficiency, and ranking challenges in zero-shot settings.
Contribution
It provides the first comprehensive zero-shot benchmark for open-source VLMs on grocery retrieval, analyzing factors like data, architecture, and input resolution.
Findings
Data quality improvements surpass model size increases.
Efficient models like MobileCLIP-B outperform larger, noisier models.
A significant gap remains in ranking accuracy at the top retrieval levels.
Abstract
Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf{(1) Data quality trumps scale.} Switching from raw web-scrapes to filtered datasets delivers up to 16.6\% accuracy gains, exceeding the benefit of doubling model parameters. \textbf{(2) Efficient models can win.} MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textit{semantic power density} (), an efficiency metric that penalizes sub-threshold accuracy. \textbf{(3) A precision gap persists.}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
