Active Learning of Input Grammars
Matthias H\"oschele, Alexander Kampmann, Andreas Zeller

TL;DR
This paper introduces a method for automatically learning human-readable input grammars from minimal samples by analyzing data flow and generalizing production rules, aiding systematic program testing.
Contribution
It presents a novel approach combining data flow analysis and membership queries to infer accurate, readable context-free grammars from limited input samples.
Findings
Generated grammars are accurate and human-readable.
The approach requires only minimal sample inputs.
Grammars can be directly used for automated testing.
Abstract
Knowing the precise format of a program's input is a necessary prerequisite for systematic testing. Given a program and a small set of sample inputs, we (1) track the data flow of inputs to aggregate input fragments that share the same data flow through program execution into lexical and syntactic entities; (2) assign these entities names that are based on the associated variable and function identifiers; and (3) systematically generalize production rules by means of membership queries. As a result, we need only a minimal set of sample inputs to obtain human-readable context-free grammars that reflect valid input structure. In our evaluation on inputs like URLs, spreadsheets, or configuration files, our AUTOGRAM prototype obtains input grammars that are both accurate and very readable - and that can be directly fed into test generators for comprehensive automated testing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research
