gMeta: Template-based Regular Expression Generation over Noisy Examples
Shujun Wang, Yongqiang Tian andDengcheng He

TL;DR
gMeta is a template-based method that generates regular expressions from noisy examples by clustering, filtering anomalies, and translating positive examples, demonstrating high effectiveness on real-world datasets and industrial scenarios.
Contribution
This paper introduces a novel templated approach with a data model and dynamic filtering to synthesize regexes from noisy data, addressing limitations of previous faultless-input assumptions.
Findings
Achieved promising F-measure results on four real-world extraction tasks.
Performed well in real industrial scenarios.
Efficient, interpretable, and extensible regex generation method.
Abstract
Regular expressions (regexes) are widely used in different fields of computer science, such as programming languages, string processing, and databases. However, existing tools for synthesizing or repairing regexes always assume that the input examples are faultless. In real industrial scenarios, this assumption does not entirely hold. Thus, this paper presents a simple but effective templated-based approach to generate regular expressions over noisy examples. Specifically, we present a data model (i.e., MetaParam) to extract features of strings for clustering all examples. Then, we propose a practical dynamic thresholding scheme to filter out anomalous examples via detecting knee points on CDF graphs. Finally, we design a template-based algorithm to translate a finite of positve examples to regular expression, which is efficient, interpretable, and extensible. We performed an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Parallel Computing and Optimization Techniques · Topic Modeling
