DescribeX: A Framework for Exploring and Querying XML Web Collections
Flavio Rizzolo

TL;DR
DescribeX is a flexible framework for creating detailed XML summaries of web collections, enabling efficient XPath query evaluation and better understanding of complex, heterogeneous XML data at scale.
Contribution
It introduces a novel, declarative approach to XML summarization using axis path regular expressions, improving query performance and understanding of web XML collections.
Findings
Scalable summary refinement for multi-gigabyte collections
Order-of-magnitude faster XPath query evaluation using DescribeX summaries
Competitive performance with traditional XML query engines
Abstract
This thesis introduces DescribeX, a powerful framework that is capable of describing arbitrarily complex XML summaries of web collections, providing support for more efficient evaluation of XPath workloads. DescribeX permits the declarative description of document structure using all axes and language constructs in XPath, and generalizes many of the XML indexing and summarization approaches in the literature. DescribeX supports the construction of heterogeneous summaries where different document elements sharing a common structure can be declaratively defined and refined by means of path regular expressions on axes, or axis path regular expression (AxPREs). DescribeX can significantly help in the understanding of both the structure of complex, heterogeneous XML collections and the behaviour of XPath queries evaluated on them. Experimental results demonstrate the scalability of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Semantic Web and Ontologies
