Extraction of Product Specifications from the Web -- Going Beyond Tables and Lists
Govind Krishnan Gangadhar, Ashish Kulkarni

TL;DR
This paper introduces a novel method for extracting product specifications from diverse HTML structures on e-commerce websites, improving recall and scalability over previous table- and list-focused approaches.
Contribution
It presents a generalized extraction approach using hand-coded and deep learned features that effectively identifies and extracts specifications from various HTML elements.
Findings
Outperforms existing models in accuracy and recall
Successfully extracted data from 14,111 diverse blocks
Demonstrates scalability for large-scale web data extraction
Abstract
E-commerce product pages on the web often present product specification data in structured tabular blocks. Extraction of these product attribute-value specifications has benefited applications like product catalogue curation, search, question answering, and others. However, across different Websites, there is a wide variety of HTML elements (like <table>, <ul>, <div>, <span>, <dl> etc.) typically used to render these blocks that makes their automatic extraction a challenge. Most of the current research has focused on extracting product specifications from tables and lists and, therefore, suffers from recall when applied to a large-scale extraction setting. In this paper, we present a product specification extraction approach that goes beyond tables or lists and generalizes across the diverse HTML elements used for rendering specification blocks. Using a combination of hand-coded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
