Adapting Vision-Language Models for E-commerce Understanding at Scale
Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi

TL;DR
This paper demonstrates how to adapt general vision-language models to e-commerce data, improving product understanding and attribute extraction while maintaining their broad multimodal capabilities through a large-scale experimental study.
Contribution
It introduces a targeted adaptation strategy for VLMs tailored to e-commerce, along with a comprehensive evaluation suite for deep product understanding.
Findings
Significant performance improvements in e-commerce tasks
Preservation of general multimodal capabilities
Effective attribute extraction and instruction following
Abstract
E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Text and Document Classification Technologies · Sentiment Analysis and Opinion Mining
