Saggitarius: A DSL for Specifying Grammatical Domains
Anders Miltner, Devon Loehr, Arnold Mong, Kathleen Fisher and, David Walker

TL;DR
Saggitarius is a new language and system that enables reasoning about data formats by describing sets of context-free grammars, facilitating analysis and inference of data representations from examples.
Contribution
It introduces a novel language and algorithm for inferring data grammars from examples, with applications in data validation and dialect detection.
Findings
Typically infers a satisfying grammar within a few seconds
Achieves 84% success rate in CSV dialect detection within 60 seconds
Offers comparable accuracy to specialized tools for data format inference
Abstract
Common data types like dates, addresses, phone numbers and tables can have multiple textual representations, and many heavily-used languages, such as SQL, come in several dialects. These variations can cause data to be misinterpreted, leading to silent data corruption, failure of data processing systems, or even security vulnerabilities. Saggitarius is a new language and system designed to help programmers reason about the format of data, by describing grammatical domains -- that is, sets of context-free grammars that describe the many possible representations of a datatype. We describe the design of Saggitarius via example and provide a relational semantics. We show how Saggitarius may be used to analyze a data set: given example data, it uses an algorithm based on semi-ring parsing and MaxSAT to infer which grammar in a given domain best matches that data. We evaluate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Engineering Research · Text Readability and Simplification
