Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
Geert Jan Bex, Wouter Gelade, Frank Neven, Stijn Vansummeren

TL;DR
This paper introduces a probabilistic method to learn deterministic regular expressions with limited symbol occurrences from XML data, aiding schema inference, validated on real and synthetic datasets.
Contribution
It presents a novel probabilistic algorithm for learning k-occurrence regular expressions, improving schema inference from XML data by focusing on practical subclasses.
Findings
Effective on real-world XML datasets
Outperforms previous simpler models
Conservatively extends prior methods
Abstract
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Machine Learning and Algorithms
