XTreePath: A generalization of XPath to handle real world structural   variation

Joseph Paul Cohen; Wei Ding; Abraham Bagherjeiran

arXiv:1505.01303·cs.IR·December 29, 2017·1 cites

XTreePath: A generalization of XPath to handle real world structural variation

Joseph Paul Cohen, Wei Ding, Abraham Bagherjeiran

PDF

Open Access 1 Repo

TL;DR

XTreePath enhances information extraction by capturing contextual DOM node information and using recursive tree matching, significantly improving robustness against HTML template changes compared to traditional XPath methods.

Contribution

It introduces XTreePath, a novel annotation and matching approach that handles real-world structural variations in web data extraction.

Findings

01

XTreePath outperforms XPath in extraction success rate.

02

The method is effective across diverse websites and domains.

03

It demonstrates robustness to HTML template changes.

Abstract

We discuss a key problem in information extraction which deals with wrapper failures due to changing content templates. A good proportion of wrapper failures are due to HTML templates changing to cause wrappers to become incompatible after element inclusion or removal in a DOM (Tree representation of HTML). We perform a large-scale empirical analyses of the causes of shift and mathematically quantify the levels of domain difficulty based on entropy. We propose the XTreePath annotation method to captures contextual node information from the training DOM. We then utilize this annotation in a supervised manner at test time with our proposed Recursive Tree Matching method which locates nodes most similar in context recursively using the tree edit distance. The search is based on a heuristic function that takes into account the similarity of a tree compared to the structure that was present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ieee8023/XTreePath
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Advanced Malware Detection Techniques · Software Engineering Research