Looking for non-compliant documents using error messages from multiple   parsers

Michael Robinson

arXiv:2012.10211·cs.OH·December 21, 2020

Looking for non-compliant documents using error messages from multiple parsers

Michael Robinson

PDF

TL;DR

This paper proposes a statistical approach using error messages from multiple parsers to more reliably identify non-compliant files, improving security and robustness without relying on formal format specifications.

Contribution

It introduces a format-agnostic method based on pseudo-likelihood ratio tests and principal components analysis to detect non-compliance and assess format variability.

Findings

01

Effective detection of non-compliant files using multiple parser error messages

02

Format variability measurement through principal components analysis

03

Method is format-agnostic and does not depend on formal specifications

Abstract

Whether a file is accepted by a single parser is not a reliable indication of whether a file complies with its stated format. Bugs within both the parser and the format specification mean that a compliant file may fail to parse, or that a non-compliant file might be read without any apparent trouble. The latter situation presents a significant security risk, and should be avoided. This article suggests that a better way to assess format specification compliance is to examine the set of error messages produced by a set of parsers rather than a single parser. If both a sample of compliant files and a sample of non-compliant files are available, then we show how a statistical test based on a pseudo-likelihood ratio can be very effective at determining a file's compliance. Our method is format agnostic, and does not directly rely upon a formal specification of the format. Although this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.