Adapting Vision-Language Models for E-commerce Understanding at Scale

Matteo Nulli; Vladimir Orshulevich; Tala Bazazo; Christian Herold; Michael Kozielski; Marcin Mazur; Szymon Tuzel; Cees G. M. Snoek; Seyyed Hadi Hashemi; Omar Javed; Yannick Versley; Shahram Khadivi

arXiv:2602.11733·cs.CV·February 13, 2026

Adapting Vision-Language Models for E-commerce Understanding at Scale

Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi

PDF

Open Access 1 Video

TL;DR

This paper demonstrates how to adapt general vision-language models to e-commerce data, improving product understanding and attribute extraction while maintaining their broad multimodal capabilities through a large-scale experimental study.

Contribution

It introduces a targeted adaptation strategy for VLMs tailored to e-commerce, along with a comprehensive evaluation suite for deep product understanding.

Findings

01

Significant performance improvements in e-commerce tasks

02

Preservation of general multimodal capabilities

03

Effective attribute extraction and instruction following

Abstract

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Adapting Vision-Language Models for E-commerce Understanding at Scale· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Text and Document Classification Technologies · Sentiment Analysis and Opinion Mining