Inferring Input Grammars from Dynamic Control Flow
Rahul Gopinath, Bj\"orn Mathis, Andreas Zeller

TL;DR
This paper introduces a general algorithm that infers human-readable context-free grammars from programs by analyzing input character access patterns, applicable to various recursive descent parsers without heuristics.
Contribution
The authors present a novel, heuristic-free method to automatically derive readable input grammars from program execution traces, applicable to stack-based recursive descent parsers.
Findings
Produced accurate grammars for expr, URLparse, and microJSON
Works without heuristics on all stack-based recursive descent parsers
Generates readable grammars from small sample inputs
Abstract
A program is characterized by its input model, and a formal input model can be of use in diverse areas including vulnerability analysis, reverse engineering, fuzzing and software testing, clone detection and refactoring. Unfortunately, input models for typical programs are often unavailable or out of date. While there exist algorithms that can mine the syntactical structure of program inputs, they either produce unwieldy and incomprehensible grammars, or require heuristics that target specific parsing patterns. In this paper, we present a general algorithm that takes a program and a small set of sample inputs and automatically infers a readable context-free grammar capturing the input language of the program. We infer the syntactic input structure only by observing access of input characters at different locations of the input parser. This works on all program stack based recursive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Advanced Malware Detection Techniques
