Extracting Syntactic Patterns from Databases
Andrew Ilyas, Joana M. F. da Trindade, Raul Castro Fernandez, Samuel, Madden

TL;DR
This paper introduces XSystem, a fast and efficient method for learning regular expression patterns from database columns, enabling applications like outlier detection, column similarity measurement, and semantic labeling.
Contribution
We develop XSystem, a novel approach that significantly reduces the time to learn regular expressions from database data, enhancing practical data analysis tasks.
Findings
XSystem learns patterns much faster than existing methods.
Patterns captured by XSystem effectively identify outliers and similar columns.
The approach supports semantic labeling of database fields.
Abstract
Many database columns contain string or numerical data that conforms to a pattern, such as phone numbers, dates, addresses, product identifiers, and employee ids. These patterns are useful in a number of data processing applications, including understanding what a specific field represents when field names are ambiguous, identifying outlier values, and finding similar fields across data sets. One way to express such patterns would be to learn regular expressions for each field in the database. Unfortunately, exist- ing techniques on regular expression learning are slow, taking hundreds of seconds for columns of just a few thousand values. In contrast, we develop XSystem, an efficient method to learn patterns over database columns in significantly less time. We show that these patterns can not only be built quickly, but are expressive enough to capture a number of key applications,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems
