Learning from Uncurated Regular Expressions

Michael J. Mior

arXiv:2206.06747·cs.DB·March 18, 2024

Learning from Uncurated Regular Expressions

Michael J. Mior

PDF

Open Access 1 Repo

TL;DR

This paper proposes a novel method for learning from uncurated, publicly available regular expressions to improve feature extraction and classification tasks, offering a scalable alternative to manual or data-driven approaches.

Contribution

It introduces a new approach that leverages uncurated regexes from public repositories for feature extraction, reducing overhead and broadening applicability.

Findings

01

Model trained on uncurated regexes performs competitively on semantic classification.

02

Feature extraction code is significantly smaller than existing methods.

03

Approach enables unsupervised learning using uncurated regex data.

Abstract

Significant work has been done on learning regular expressions from a set of data values. Depending on the domain, this approach can be very successful. However, significant time is required to learn these expressions and the resulting expressions can become either very complex or inaccurate in the presence of dirty data. The alternative of manually writing regular expressions becomes unattractive when faced with a large number of values that must be matched. As an alternative, we propose learning from a large corpus of manually authored, but uncurated regular expressions mined from a public repository. The advantage of this approach is that we are able to extract salient features from a set of strings with limited overhead to feature engineering. Since the set of regular expressions covers a wide range of application domains, we expect them to be widely applicable. To demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dataunitylab/semantic-regex
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Algorithms and Data Compression · Machine Learning and Algorithms