Statistical detection of format dialects using the weighted Dowker complex
Michael Robinson, Letitia W. Li, Cory Anderson, Steve Huntsman

TL;DR
This paper introduces a probabilistic model using the weighted Dowker complex to detect format dialects in files, enabling effective classification and boundary detection with minimal training data.
Contribution
It presents a novel probabilistic framework for format dialect detection based on Boolean message patterns and the weighted Dowker complex, with theoretical and empirical validation.
Findings
The classification algorithm outperforms message counting methods.
The model can be bootstrapped from predominantly one dialect.
Violations of message independence are detectable, aiding dialect boundary identification.
Abstract
This paper provides an experimentally validated, probabilistic model of file behavior when consumed by a set of pre-existing parsers. File behavior is measured by way of a standardized set of Boolean "messages" produced as the files are read. By thresholding the posterior probability that a file exhibiting a particular set of messages is from a particular dialect, our model yields a practical classification algorithm for two dialects. We demonstrate that this thresholding algorithm for two dialects can be bootstrapped from a training set consisting primarily of one dialect. Both the (parametric) theoretical and the (non-parametric) empirical distributions of file behaviors for one dialect yield good classification performance, and outperform classification based on simply counting messages. Our theoretical framework relies on statistical independence of messages within each dialect.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Authorship Attribution and Profiling · Natural Language Processing Techniques
