Learning Deterministic Regular Expressions for the Inference of Schemas   from XML Data

Geert Jan Bex; Wouter Gelade; Frank Neven; Stijn Vansummeren

arXiv:1004.2372·cs.DB·April 15, 2010

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Geert Jan Bex, Wouter Gelade, Frank Neven, Stijn Vansummeren

PDF

Open Access

TL;DR

This paper introduces a probabilistic method to learn deterministic regular expressions with limited symbol occurrences from XML data, aiding schema inference, validated on real and synthetic datasets.

Contribution

It presents a novel probabilistic algorithm for learning k-occurrence regular expressions, improving schema inference from XML data by focusing on practical subclasses.

Findings

01

Effective on real-world XML datasets

02

Outperforms previous simpler models

03

Conservatively extends prior methods

Abstract

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Machine Learning and Algorithms