Visual Zero-Shot E-Commerce Product Attribute Value Extraction
Jiaying Gong, Ming Cheng, Hongda Shen, Pierre-Yves Vandenbussche,, Janet Jenq, Hoda Eldardiry

TL;DR
This paper introduces ViOC-AG, a cross-modal zero-shot framework that extracts product attribute values from images alone, reducing seller effort and outperforming existing models in e-Commerce applications.
Contribution
The paper presents a novel CLIP-based zero-shot attribute value extraction method that requires only images, with a task-specific text decoder and OCR/LLM corrections, avoiding manual descriptions.
Findings
ViOC-AG outperforms fine-tuned vision-language models in zero-shot extraction accuracy.
The framework effectively integrates OCR tokens and LLM outputs for improved attribute value correction.
It reduces the need for manual product descriptions, streamlining e-Commerce workflows.
Abstract
Existing zero-shot product attribute value (aspect) extraction approaches in e-Commerce industry rely on uni-modal or multi-modal models, where the sellers are asked to provide detailed textual inputs (product descriptions) for the products. However, manually providing (typing) the product descriptions is time-consuming and frustrating for the sellers. Thus, we propose a cross-modal zero-shot attribute value generation framework (ViOC-AG) based on CLIP, which only requires product images as the inputs. ViOC-AG follows a text-only training process, where a task-customized text decoder is trained with the frozen CLIP text encoder to alleviate the modality gap and task disconnection. During the zero-shot inference, product aspects are generated by the frozen CLIP image encoder connected with the trained task-customized text decoder. OCR tokens and outputs from a frozen prompt-based LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsWeb Data Mining and Analysis · Text and Document Classification Technologies · Sentiment Analysis and Opinion Mining
MethodsContrastive Language-Image Pre-training
