Infer XPath
Micha{\l} J. Gajda, Hai Nguyen Quang, Do Ngoc Khanh, and Vuong Hai, Thanh

TL;DR
This paper introduces InferXPath, a method to automatically discover XPath expressions by analyzing web page data structures, aiming to automate and accelerate the conversion of web content into structured tabular data.
Contribution
It presents a novel approach to automatically infer XPath expressions from web pages, extending web page analysis to relation discovery between document nodes.
Findings
Successfully automates XPath discovery process
Speeds up conversion of web pages into structured data
Reduces manual effort in web data extraction
Abstract
We propose reformulation of discovery of data structure within a web page as relations between sets of document nodes. We start by reformulating web page analysis as finding expressions in extension of XPath. Then we propose to automatically discover these XPath expressions with InferXPath meta-language. Our goal is to automate laborious process of conversion of manually created web pages that serve as software documentations, wikis, and reference documents, and speed up their conversion into tabular data that can be directly fed into data pipeline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Advanced Database Systems and Queries · Scientific Computing and Data Management
