Learning Highly Recursive Input Grammars

Neil Kulkarni; Caroline Lemieux; Koushik Sen

arXiv:2108.13340·cs.SE·August 31, 2021

Learning Highly Recursive Input Grammars

Neil Kulkarni, Caroline Lemieux, Koushik Sen

PDF

Open Access 1 Repo

TL;DR

This paper introduces Arvada, a novel algorithm for learning context-free grammars from positive examples and an oracle, which effectively captures recursive structures, outperforming previous methods like GLADE in recall and F1 score.

Contribution

Arvada's key innovation is the bubbling operation that enables recursive generalization, significantly improving grammar learning for highly recursive languages.

Findings

01

Arvada achieves 4.98x higher recall than GLADE.

02

Arvada attains 3.13x higher F1 score than GLADE.

03

Arvada requires fewer oracle calls, only 0.87x of GLADE's.

Abstract

This paper presents Arvada, an algorithm for learning context-free grammars from a set of positive examples and a Boolean-valued oracle. Arvada learns a context-free grammar by building parse trees from the positive examples. Starting from initially flat trees, Arvada builds structure to these trees with a key operation: it bubbles sequences of sibling nodes in the trees into a new node, adding a layer of indirection to the tree. Bubbling operations enable recursive generalization in the learned grammar. We evaluate Arvada against GLADE and find it achieves on average increases of 4.98x in recall and 3.13x in F1 score, while incurring only a 1.27x slowdown and requiring only 0.87x as many calls to the oracle. Arvada has a particularly marked improvement over GLADE on grammars with highly recursive structure, like those of programming languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neil-kulkarni/arvada
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Natural Language Processing Techniques · Software Testing and Debugging Techniques