PAM: Understanding Product Images in Cross Product Category Attribute Extraction
Rongmei Lin, Xiang He, Jie Feng, Nasser Zalmout, Yan Liang, Li Xiong,, Xin Luna Dong

TL;DR
This paper introduces a transformer-based multimodal framework for extracting product attributes from images, text, and OCR tokens, improving accuracy across multiple categories in e-commerce.
Contribution
It presents a unified, multimodal attribute extraction model that leverages visual and textual cues, conditioned on product category, outperforming text-only methods.
Findings
15% gain in recall over text-only methods
10% improvement in F1 score
Effective across 14 product categories
Abstract
Understanding product attributes plays an important role in improving online shopping experience for customers and serves as an integral part for constructing a product knowledge graph. Most existing methods focus on attribute extraction from text description or utilize visual information from product images such as shape and color. Compared to the inputs considered in prior works, a product image in fact contains more information, represented by a rich mixture of words and visual clues with a layout carefully designed to impress customers. This work proposes a more inclusive framework that fully utilizes these different modalities for attribute extraction. Inspired by recent works in visual question answering, we use a transformer based sequence to sequence model to fuse representations of product text, Optical Character Recognition (OCR) tokens and visual objects detected in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
