Context-Free Grammar Inference for Complex Programming Languages in Black Box Settings
Feifei Li, Xiao Chen, Xiaoyu Sun, Xi Xiao, Shaohua Wang, Yong Ding, Sheng Wen, Qing Li

TL;DR
This paper introduces Crucio, a novel grammar inference method that efficiently infers complex programming language grammars in black box settings, outperforming existing tools in scalability and accuracy.
Contribution
Crucio employs a decomposition forest and distributional matrix to improve grammar inference for complex languages, overcoming limitations of prior approaches.
Findings
Crucio successfully infers grammars for complex languages with up to 23x more nonterminals.
It achieves 1.37x higher recall and 1.21x higher F1 scores than Treevada on prior benchmarks.
Crucio infers grammars within practical time limits, unlike existing tools.
Abstract
Grammar inference for complex programming languages remains a significant challenge, as existing approaches fail to scale to real world datasets within practical time constraints. In our experiments, none of the state-of-the-art tools, including Arvada, Treevada and Kedavra were able to infer grammars for complex languages such as C, C++, and Java within 48 hours. Arvada and Treevada perform grammar inference directly on full-length input examples, which proves inefficient for large files commonly found in such languages. While Kedavra introduces data decomposition to create shorter examples for grammar inference, its lexical analysis still relies on the original inputs. Additionally, its strict no-overgeneralization constraint limits the construction of complex grammars. To overcome these limitations, we propose Crucio, which builds a decomposition forest to extract short examples…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Natural Language Processing Techniques · Topic Modeling
