Extraction of Product Specifications from the Web -- Going Beyond Tables   and Lists

Govind Krishnan Gangadhar; Ashish Kulkarni

arXiv:2201.02896·cs.IR·January 11, 2022

Extraction of Product Specifications from the Web -- Going Beyond Tables and Lists

Govind Krishnan Gangadhar, Ashish Kulkarni

PDF

TL;DR

This paper introduces a novel method for extracting product specifications from diverse HTML structures on e-commerce websites, improving recall and scalability over previous table- and list-focused approaches.

Contribution

It presents a generalized extraction approach using hand-coded and deep learned features that effectively identifies and extracts specifications from various HTML elements.

Findings

01

Outperforms existing models in accuracy and recall

02

Successfully extracted data from 14,111 diverse blocks

03

Demonstrates scalability for large-scale web data extraction

Abstract

E-commerce product pages on the web often present product specification data in structured tabular blocks. Extraction of these product attribute-value specifications has benefited applications like product catalogue curation, search, question answering, and others. However, across different Websites, there is a wide variety of HTML elements (like <table>, <ul>, <div>, <span>, <dl> etc.) typically used to render these blocks that makes their automatic extraction a challenge. Most of the current research has focused on extracting product specifications from tables and lists and, therefore, suffers from recall when applied to a large-scale extraction setting. In this paper, we present a product specification extraction approach that goes beyond tables or lists and generalizes across the diverse HTML elements used for rendering specification blocks. Using a combination of hand-coded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.