Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder
Desheng Wang, Jiawei Liu, Xiang Qi, Baolin Sun, Peng Zhang

TL;DR
This paper introduces a novel genetic algorithm for automatic regex generation that leverages byte pair encoding to improve effectiveness and efficiency, demonstrating significant performance gains on diverse datasets.
Contribution
It proposes a new regex generation method combining BPE and multi-objective genetic algorithms, achieving faster training and better results than existing baselines.
Findings
Outperforms baseline on 10 of 13 datasets.
Nearly 50% average improvement in accuracy.
Training speed increased by approximately 100 times with exponential decay.
Abstract
Regular expression is important for many natural language processing tasks especially when used to deal with unstructured and semi-structured data. This work focuses on automatically generating regular expressions and proposes a novel genetic algorithm to deal with this problem. Different from the methods which generate regular expressions from character level, we first utilize byte pair encoder (BPE) to extract some frequent items, which are then used to construct regular expressions. The fitness function of our genetic algorithm contains multi objectives and is solved based on evolutionary procedure including crossover and mutation operation. In the fitness function, we take the length of generated regular expression, the maximum matching characters and samples for positive training samples, and the minimum matching characters and samples for negative training samples into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Topic Modeling
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Exponential Decay
