Looking for non-compliant documents using error messages from multiple parsers
Michael Robinson

TL;DR
This paper proposes a statistical approach using error messages from multiple parsers to more reliably identify non-compliant files, improving security and robustness without relying on formal format specifications.
Contribution
It introduces a format-agnostic method based on pseudo-likelihood ratio tests and principal components analysis to detect non-compliance and assess format variability.
Findings
Effective detection of non-compliant files using multiple parser error messages
Format variability measurement through principal components analysis
Method is format-agnostic and does not depend on formal specifications
Abstract
Whether a file is accepted by a single parser is not a reliable indication of whether a file complies with its stated format. Bugs within both the parser and the format specification mean that a compliant file may fail to parse, or that a non-compliant file might be read without any apparent trouble. The latter situation presents a significant security risk, and should be avoided. This article suggests that a better way to assess format specification compliance is to examine the set of error messages produced by a set of parsers rather than a single parser. If both a sample of compliant files and a sample of non-compliant files are available, then we show how a statistical test based on a pseudo-likelihood ratio can be very effective at determining a file's compliance. Our method is format agnostic, and does not directly rely upon a formal specification of the format. Although this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
