Unshuffling fields in data formats

Steve Huntsman

arXiv:1910.09964·cs.IT·March 3, 2020

Unshuffling fields in data formats

Steve Huntsman

PDF

Open Access

TL;DR

This paper introduces a formal approach to unshuffling data format chunks with common length statistics, providing probabilistic bounds and algorithms to aid reverse engineering of complex data formats.

Contribution

It formalizes the unshuffling problem, derives probabilistic bounds, and presents algorithms, advancing data format reverse engineering techniques.

Findings

01

Unshuffling can be effectively formalized and bounded probabilistically.

02

Algorithms for unshuffling improve reverse engineering of chunked data formats.

03

Connections with synchronization problems enhance understanding of data format structures.

Abstract

Data format reverse engineering commonly involves identifying conserved format motifs. However, this process typically requires establishing a common ordering for format elements across instances, particularly for formats using type-(length)-value tuples or "chunk" encoding. It is useful to \emph{unshuffle} chunks with common length statistics as a precursor to identifying conserved internal structures. We formalize the unshuffling problem and subsequently derive probabilistic bounds and outline corresponding algorithms for it. We empirically demonstrate unshuffling and highlight connections with the related class of synchronization problems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Parallel Computing and Optimization Techniques · Machine Learning and Algorithms