Statistical detection of format dialects using the weighted Dowker   complex

Michael Robinson; Letitia W. Li; Cory Anderson; Steve Huntsman

arXiv:2201.08267·cs.CE·September 23, 2022·1 cites

Statistical detection of format dialects using the weighted Dowker complex

Michael Robinson, Letitia W. Li, Cory Anderson, Steve Huntsman

PDF

Open Access

TL;DR

This paper introduces a probabilistic model using the weighted Dowker complex to detect format dialects in files, enabling effective classification and boundary detection with minimal training data.

Contribution

It presents a novel probabilistic framework for format dialect detection based on Boolean message patterns and the weighted Dowker complex, with theoretical and empirical validation.

Findings

01

The classification algorithm outperforms message counting methods.

02

The model can be bootstrapped from predominantly one dialect.

03

Violations of message independence are detectable, aiding dialect boundary identification.

Abstract

This paper provides an experimentally validated, probabilistic model of file behavior when consumed by a set of pre-existing parsers. File behavior is measured by way of a standardized set of Boolean "messages" produced as the files are read. By thresholding the posterior probability that a file exhibiting a particular set of messages is from a particular dialect, our model yields a practical classification algorithm for two dialects. We demonstrate that this thresholding algorithm for two dialects can be bootstrapped from a training set consisting primarily of one dialect. Both the (parametric) theoretical and the (non-parametric) empirical distributions of file behaviors for one dialect yield good classification performance, and outperform classification based on simply counting messages. Our theoretical framework relies on statistical independence of messages within each dialect.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Authorship Attribution and Profiling · Natural Language Processing Techniques