Why Aren't Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions
James C. Davis, Louis G. Michael IV, Christy A. Coghlan, Francisco, Servant, and Dongyoon Lee

TL;DR
This study investigates the portability of regular expressions across programming languages, revealing that despite common usage, significant semantic and performance differences can cause errors and vulnerabilities.
Contribution
The paper provides the first large-scale empirical analysis of regex portability issues across multiple languages, identifying semantic and performance discrepancies and uncovering engine bugs.
Findings
15% regexes have semantic differences across languages
10% regexes exhibit performance differences
Bugs found in JavaScript, Python, Ruby, and Rust regex engines
Abstract
This paper explores the extent to which regular expressions (regexes) are portable across programming languages. Many languages offer similar regex syntaxes, and it would be natural to assume that regexes can be ported across language boundaries. But can regexes be copy/pasted across language boundaries while retaining their semantic and performance characteristics? In our survey of 158 professional software developers, most indicated that they re-use regexes across language boundaries and about half reported that they believe regexes are a universal language. We experimentally evaluated the riskiness of this practice using a novel regex corpus -- 537,806 regexes from 193,524 projects written in JavaScript, Java, PHP, Python, Ruby, Go, Perl, and Rust. Using our polyglot regex corpus, we explored the hitherto-unstudied regex portability problems: logic errors due to semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Advanced Malware Detection Techniques
