Discovering Restricted Regular Expressions with Interleaving
Feifei Peng, Haiming Chen

TL;DR
This paper addresses the challenge of learning minimal, unordered XML schemas with interleaving from example data, proposing new algorithms to approximate solutions for an NP-hard problem.
Contribution
It introduces a novel approximation algorithm and heuristic for inferring minimal interleaving schemas, which previous methods could not effectively handle.
Findings
Heuristic results are close to optimal
Algorithms work effectively on real-world datasets
Schema inference with interleaving is NP-hard
Abstract
Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot be disclosed by previous inference techniques. Moreover, inference of the minimal schema with interleaving is challenging. The problem of finding a minimal schema with interleaving is shown to be NP-hard. Therefore, we develop an approximation algorithm and a heuristic solution to tackle the problem using techniques different from known inference algorithms. We do experiments on real-world data sets to demonstrate the effectiveness of our approaches. Our heuristic algorithm is shown to produce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Data Mining Algorithms and Applications
