XTreePath: A generalization of XPath to handle real world structural variation
Joseph Paul Cohen, Wei Ding, Abraham Bagherjeiran

TL;DR
XTreePath enhances information extraction by capturing contextual DOM node information and using recursive tree matching, significantly improving robustness against HTML template changes compared to traditional XPath methods.
Contribution
It introduces XTreePath, a novel annotation and matching approach that handles real-world structural variations in web data extraction.
Findings
XTreePath outperforms XPath in extraction success rate.
The method is effective across diverse websites and domains.
It demonstrates robustness to HTML template changes.
Abstract
We discuss a key problem in information extraction which deals with wrapper failures due to changing content templates. A good proportion of wrapper failures are due to HTML templates changing to cause wrappers to become incompatible after element inclusion or removal in a DOM (Tree representation of HTML). We perform a large-scale empirical analyses of the causes of shift and mathematically quantify the levels of domain difficulty based on entropy. We propose the XTreePath annotation method to captures contextual node information from the training DOM. We then utilize this annotation in a supervised manner at test time with our proposed Recursive Tree Matching method which locates nodes most similar in context recursively using the tree edit distance. The search is based on a heuristic function that takes into account the similarity of a tree compared to the structure that was present…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Advanced Malware Detection Techniques · Software Engineering Research
